Decoding the Critique Mechanism in Large Reasoning Models
Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behavio…
Authors: Hoang Phan, Quang H. Nguyen, Hung T. Q. Le
Decoding the Critique Mechanism in Large Reasoning Models Hoang Phan 1 Quang H. Nguyen 1 Hung T . Q. Le 1 Xiusi Chen 2 Heng Ji 2 Khoa D. Doan 1 Abstract Large Reasoning Models (LRMs) exhibit back- tracking and self-verification mechanisms that en- able them to re vise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. W e hypothesize that such beha viors are beneficial only when the model has sufficiently strong “critique” ability to detect its o wn mistakes. This work systemati- cally in vestigates ho w current LRMs recov er from errors by inserting arithmetic mistakes in their in- termediate reasoning steps. Notably , we discov er a peculiar yet important phenomenon: despite the error propagating through the chain-of-thought (CoT), resulting in an incorrect intermediate con- clusion, the model still reaches the correct final answer . This recov ery implies that the model must possess an internal mechanism to detect er - rors and trigger self-correction, which we refer to as the hidden critique ability . Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior . Ex- tensiv e experiments across multiple model scales and families demonstrate that steering latent rep- resentations with this v ector impro ves the model’ s error detection capability and enhances the perfor - mance of test-time scaling at no extra training cost. Our findings provide a v aluable understanding of LRMs’ critique behavior , suggesting a promis- ing direction to control and improv e their self- verification mechanism. Our code is a vailable at § mail-research/lrm-critique-vectors . 1. Introduction The recent wa ve of large reasoning models (LRMs)—including the o1 series ( OpenAI et al. , 2024 ), DeepSeek-R1 ( Guo et al. , 2025 ), and Qwen3 ( Y ang et al. , 1 College of Engineering and Computer Science, V inUni versity , V ietnam 2 Univ ersity of Illinois Urbana-Champaign. Correspon- dence to: Khoa D. Doan < khoa.dd@vinuni.edu.vn > . Pr eprint. Mar ch 18, 2026. T h i n k i n g : E r r o r + T h i n k i n g : F i n a l r e s p o n s e : G r o u n d t r u t h $ 1 8 I n e e d t o n d h o w m u c h m o n e y . . . S h e u s e s 3 + 4 = 7 e g g s . T h a t l e a v e s 1 6 − 7 = 9 e g g s t o s e l l . W i t h 9 e g g s , 9 × 2 = $ 1 8 . A t $ 2 e a c h , s h e m a k e s $ 1 8 p e r d a y . < t h i n k > < / t h i n k > F i r s t , I w a n t t o n d h o w m a n y . . . S o s h e u s e s 3 + 4 = 6 e g g s p e r d a y < t h i n k > S h e o n l y h a s 1 6 − 6 = 1 0 e g g s l e f t . A t $ 2 e a c h , 1 0 × 2 = $ 2 0 . S o , s h e e a r n s $ 2 0 p e r d a y . < / t h i n k > J a n e t u s e s 7 o f h e r 1 6 e g g s e a c h d a y . T h a t l e a v e s 9 e g g s f r o m t h e 1 6 t o s e l l . A t $ 2 p e r e g g , s h e e a r n s $ 1 8 d a i l y . J a n e t u s e s 3 + 4 = 7 e g g s e a c h d a y . T h a t l e a v e s 1 6 - 7 = 9 e g g s t o s e l l . A t $ 2 p e r e g g , s h e e a r n s $ 1 8 d a i l y . O r i g i n a l I n t e r v e n e d Q u e s t i o n : J a n e t ’ s d u c k s p r o d u c e 1 6 e g g s e a c h d a y , s h e u s e s 3 f o r b r e a k f a s t a n d 4 f o r b a k i n g , a n d s e l l s t h e r e s t a t $ 2 p e r e g g — h o w m u c h m o n e y d o e s s h e e a r n d a i l y a t t h e f a r m e r s ’ m a r k e t ? F i n a l r e s p o n s e : ... P r e t r a i n e d L R M s ... F igur e 1. Hidden self-correction despite erroneous reasoning in R1-32B. Left: original correct reasoning. Right: injected error (3 + 4 = 6 ) propagates to incorrect thinking conclusion ( $20 ), yet the model recovers to the correct final answer ( $18 ). W e hypothe- size this indicates implicit error detection be yond the observable CoT . Full generation details are in Appendix E.1 . 2025a )—has substantially raised the performance ceiling on complex, multi-step reasoning benchmarks. Recent studies ( Guo et al. , 2025 ; Y ang et al. , 2025d ) show that these models exhibit human-like “aha moments”, including backtracking and self-verification (e.g., “W ait, I made a mistake!”), resulting in non-linear traces rather than traditional linear CoT and leading to strong performance on complex logical benchmarks. Y et, many “aha moments” and backtracking steps are redundant or illusory , leading to overthinking: models produce excessiv e tokens, enter self-doubt loops, and often degrade performance due to a lack of critical thinking ( Chen et al. , 2025d ; Fan et al. , 2025 ; Zhao et al. , 2025 ). This suggests that while the model possesses some “critique” ability to detect its own mistakes, this ability is still intrinsically weak, underscoring the need for methods that enable the models to reliably detect genuine errors, execute precise self-correction, and eliminate decorative reasoning steps. Throughout this paper , we will use the terms critique , self-critique , and self-corr ection interchangeably to refer the ability of LRMs to detect and rectify errors. 1 Decoding the Critique Mechanism in Large Reasoning Models Recent work on test-time scaling increases inference com- pute to enhance reasoning performance, rather than scal- ing to larger parameters. For instance, s1 ( Muennighoff et al. , 2025 ) sho ws that minimal fine-tuning with budget forcing, appending “W ait” tokens during inference, yields significant gains, while compute-optimal strate gies enable smaller models to outperform larger ones ( Snell et al. , 2025 ). Howe ver , test-time scaling does not consistently improve performance ( Ghosal et al. , 2025 ; Zeng et al. , 2025 ), as its ef- fecti veness depends on the model’ s intrinsic self-v erification capacity ( Setlur et al. , 2025 ; Zeng et al. , 2025 ). A key bot- tleneck is: LLMs struggle to detect errors in long CoTs ( He et al. , 2025 ), fail to self-correct without external guidance ( Kamoi et al. , 2024 ), and cannot reliably identify their own mistakes ( T yen et al. , 2024 ). While training-based methods improv e verification capabilities ( W eng et al. , 2023 ; Kang et al. , 2025 ; Chen et al. , 2025b ; Lee et al. , 2025c ), they require substantial training costs, architectural comple xity , or external tools, reducing practicality . W e hypothesize that the budget forcing approach relies hea vily on the error de- tection and correction ability , and in vestigate whether this ability can be identified and controlled without additional training. Understanding how LRMs handle errors internally requires samples where the model mak es mistakes and performs self- correction. Howe ver , we observe that such instances are rare in standard generation.When the model answers correctly , CoTs are typically all correct; when it is wrong, errors per - sist throughout with no clear reco very . Moreover , labeling ev ery step is prohibiti vely expensiv e. Instead, we deliber- ately inject arithmetic errors (e.g., 3 + 4 = 6 ) at intermediate steps within the reasoning trace and let the model continue without terminating its thinking process (Figure 1 ). W e then compare the thinking conclusions against the ground-truth answers to assess its recovery , yielding a scalable analy- sis framework without per-step supervision. Surprisingly , we observe that despite the corrupted reasoning from the injected errors, the model often recov ers and produces cor- rect final answers (Figure 2 ). W e call it the hidden critique ability , where LRMs hiddenly rectify errors without verbal- ization in chain of thoughts. T o understand this recovery , we analyze the feature space and observ e a strong separability between the error-reco very (with intervention) runs and the standard (without interven- tion) runs, although both yield correct final answers. W e hypothesize that this separability captures an internal cri- tique mechanism, i.e., the model’ s hidden ability to detect and correct errors without e xplicit verbalization. This hid- den capability can be represented as a single linear direction in the activ ation space – which is exactly the dif ference be- tween the two mean vectors of acti vations of the intervened and standard runs. W e call it the critique vector and v alidate its semantic meaning with logit lens analysis across model families. T o further ev aluate our hypothesis and test the utility of the critique v ector , we use it to steer the LRMs and ev aluate the performance on sev eral error-detection tasks. Experimental results show that positive steering nudges LRMs’ behavior to become more “critique” and improv es its mistake-identification performance, while negati ve steer- ing worsens it. Finally , we sho w that controlling LRMs with this critique vector during test-time scaling enables the model to detect its mistakes better and produce more correct solutions. In summary , our contributions are as follows: • W e are the first that uncov er the hidden self-correction phenomenon in LRMs: injecting a simple arithmetic error into the CoT corrupts reasoning and intermedi- ate conclusions, yet the model implicitly recovers and produces the correct final response. • Through feature space analysis, we show that activ a- tions are highly separable between the error -recovery runs and the standard runs. Lev eraging this separation, we extract a critique v ector and rigorously validate its error-detection capability through logit lens analysis and steering experiments. • Finally , we demonstrate that the critique vector influ- ences test-time scaling: increasing it helps the model better recognize its mistakes and improv e its final ac- curacy , while decreasing it degrades performance. 2. Related W ork Representation Engineering in LLMs. Representation engineering extracts vectors from model acti vations to steer behavior without fine-tuning. Early work ( Rimsky et al. , 2024 ; T urner et al. , 2024 ) computed steering vectors from contrastiv e prompt pairs, enabling applications in style con- trol ( K onen et al. , 2024 ) and contrastiv e steering ( Rimsky et al. , 2024 ), later extended to refusal control ( Lee et al. , 2025b ; Arditi et al. , 2024 ) and improv ed instruction fol- lowing ( Stolfo et al. , 2025 ). Recent adv ances address pol- ysemanticity through feature-lev el steering ( Y ang et al. , 2025b ) and apply these techniques to large reasoning mod- els (LRMs), enabling control ov er thinking budgets ( Li et al. , 2025 ), trace editing ( Sun et al. , 2025 ), and reasoning cal- ibration ( Chen et al. , 2025c ). Unlike prior work steering style, length, or consistency , we are the first that discov er the hidden critique ability and identify a critique vector that represents latent error detection, providing direct mechanis- tic access to the internal processes underlying successful test-time scaling. Faithfulness of Chain-of-Thought Reasoning. Recent studies rev eal that reasoning traces in LRMs are often un- faithful to the model’ s actual internal computations. For 2 Decoding the Critique Mechanism in Large Reasoning Models example, models may fail to verbalize their true reasoning processes ( Chen et al. , 2025e ; Turpin et al. , 2023 ), generate misleading explanations ( Arcuschin et al. , 2025 ), or pro- duce decorati ve “aha moments” that do not reflect genuine problem-solving ( Zhao et al. , 2025 ). Our work observes a related phenomenon, where models generate correct fi- nal answers despite corrupted reasoning, further evidencing the mismatch between verbalized and internal computation. Howe ver , we go beyond observ ation by locating this capa- bility in latent space, interpreting it as a critique vector , and using it to control error detection and recov ery behavior . Error Detection and Corr ection. Prior work re veals mixed capabilities in LLM error detection and self-correction. While self-verification improves reasoning ( W eng et al. , 2023 ), fundamental limitations persist: LLMs struggle to detect errors in long CoT traces ( He et al. , 2025 ), cannot reliably correct mistakes without external guidance ( Kamoi et al. , 2024 ), and only succeed when error locations are pro- vided ( T yen et al. , 2024 ). Recent approaches address these gaps through ke y conditi on v erification ( W u et al. , 2024 ), de- composition into confidence and critique components ( Y ang et al. , 2025e ), or reinforcement learning ( Kumar et al. , 2025 ). In reasoning models, self-verification emer ges as a critical ability during training ( Guo et al. , 2025 ), yet the underlying mechanisms remain underexplored. Lee et al. ( 2025a ) exam- ined its geometry in task-specific models, while Zhang et al. ( 2025 ) labeled reasoning steps using commercial models, then trained probes sho wing that models implicitly encode correctness information. In contrast, we focus on extract- ing the model’ s intrinsic critique ability without external supervision, which enables mechanistic control over error detection across div erse logical reasoning tasks. T est-Time Scaling. T est-time scaling shifts focus from parameter scaling to allocating more computation during inference. Simple finetuning on reasoning traces and ap- pending “W ait” tokens already yield notable gains ( Muen- nighoff et al. , 2025 ), while compute-optimal strategies allow smaller models to outperform much larger ones ( Snell et al. , 2025 ). Follow-up work incorporates compute-optimal al- location ( Liu et al. , 2025 ), self-verification ( Chen et al. , 2025b ), tool-assisted verification ( Kang et al. , 2025 ), and rew ard models to assess the reasoning quality ( W ang et al. , 2024 ; Zhang et al. , 2024 ; Lightman et al. , 2024 ). Howe ver , recent studies reveal limitations to this potential, as e xtended thinking alone saturates and fails to scale beyond a certain point ( Ghosal et al. , 2025 ). Additionally , LRMs show lim- ited scaling capabilities without proper verification ( Zeng et al. , 2025 ), and scaling without verification or RL remains suboptimal ( Setlur et al. , 2025 ). W e hypothesize that cri- tique is the core mechanism underlying successful test-time scaling and demonstrate this by e xtracting critique vectors that enable direct control ov er the scaling process. 3. Uncovering Hidden Self-Correction 3.1. Intervention Framework LRMs typically operate in two phases: given the prompt q , they first generate a reasoning trace c between a special start-of-think token ⟨ think ⟩ and an end-of-think token ⟨ /think ⟩ , and then produce the final answer a . Formally: f LRM : ( q , ⟨ think ⟩ ) 7→ ( c, ⟨ /think ⟩ , a ) . (1) Previous w ork has studied the model behavior under CoT interventions via prefilling ( Y ang et al. , 2025c ; W u et al. , 2025 ), where a variable content e is integrated immediately after the thinking-start token. This generation process can be expressed as: f LRM : ( q , ⟨ think ⟩ , e ) 7→ (˜ c, ⟨ /think ⟩ , ˜ a ) . (2) Here, ˜ c represents the model’ s reasoning continuation after intervention, and the complete CoT is formed by concatenat- ing e + ˜ c . The intervention e is flexible and task-dependent; for instance, it may be an instruction, a prompt, or another mechanism designed to analyze specific model behaviors. A ke y question in our work is whether LRMs can detect and recov er from errors during their internal reasoning. T o in vestigate this, we inject an arithmetic error e (e.g., 3 + 4 = 6 ) into the reasoning chain and examine whether the continuation ˜ c recov ers from it. W e generate this error using GPT-5 with a carefully engineered prompt P err (detailed in Appendix A.1 ): f GPT-5 : ( q , P err ) 7→ e. (3) Using this approach, we construct GSM8K-Error and MA TH500-Error by generating one error e per question q , creating a consistent benchmark across all models. For ev aluation, we extract the thinking process ˜ c and the final answer a , then compare each against the ground truth y q to determine whether the model can recover from the injected errors. Unlike step-by-step supervision methods ( Zhang et al. , 2025 ) that require labeling each reasoning step, our approach is far more ef ficient and scalable. Full details on error generation process and correctness ev aluation are in Appendix E.1 and Section 5.1 , respectiv ely . 3.2. LRMs Hiddenly Recover in the Final Answer W e e valuate the error-recov ery performance of four open- source models – DeepSeek-R1 (8B, 14B, 32B) and Qwen3 (4B), on GSM8K-Error and MA TH500-Error . Details of these models are provided in Section 5.1 . Figure 2 outlines the proportions of model performance outcomes across v ar- ious LRMs, categorized by the correctness between their internal reasoning processes (“Thinking”) and final output (“ Answer”) for intervened datasets. 3 Decoding the Critique Mechanism in Large Reasoning Models GSM8K 5.4% 55.1% 39.3% R1-8B 5.8% 77.4% 16.6% R1-14B 5.9% 74.0% 19.8% R1-32B 42.7% 20.1% 37.1% Qwen3-4B MA TH500 12.1% 41.9% 44.4% 17.3% 58.6% 22.9% 18.1% 54.2% 27.0% 81.9% 6.1% 10.8% X Thinking X Answer × Thinking X Answer X Thinking × Answer × Thinking × Answer F igur e 2. Distribution of reasoning-result alignment across R1 and Qwen3 model variants. The proportions represent four dis- tinct outcomes based on the correctness of the internal “thinking” process versus the correctness of the final “answer” on the GSM8K- Error and MA TH500-Error benchmarks. In this paper , we analyze the self-correction mechanisms represented by the light blue seg- ments: ‘ × Think ✓ Answer‘. Surprisingly , be yond the dominant cases whose reasoning and final answers align (i.e., either both are correct or both are incorrect), we observe a significant fraction of cases where the model’ s visible reasoning remains corrupted, yet the final answer is correct. W e hypothesize that in these cases, LRMs implicitly correct the injected mistake without verbalizing in the CoT . Specifically , on GSM8K-Error, R1 models consistently e x- hibit a hidden self-correction ability . In 50% to 70% of cases, they produce correct final answers, yet the injected errors remain intact in their reasoning trace. This implicit r ecovery persists across model scales and on the more chal- lenging MA TH500-Error ( 41–58% ), while explicit correc- tions remain rare ( < 6% ). Unlike R1 models, the Qwen3-4B model demonstrates po w- erful explicit backtracking, recovering from the injected error in 37%–41% of GSM8K-Error cases and 78% of MA TH500-Error cases. Howe ver , there are still a mod- erate number of samples, 19%–25% on GSM8K-Error and 6%–8% on MA TH500-Error , where Qwen3 models return the correct final answer without correcting the mistake in CoTs. This suggests that the hidden self-correction mecha- nism exists consistently across dif ferent model families. 4. The Geometry of Critique Ability Section 3 sho ws that LRMs possess a hidden critique ability to recov er from injected errors. Howe ver , this behavior disappears in the original samples with natural CoTs. In this section, we compare these two cases to extract and analyze the critique ability using mechanistic interpretability tools. W e observe linear separability in the latent space between the intervened (incorrect-CoT) and baseline (correct-CoT) runs in Section 4.2 . This linear separability suggests that the hidden capability can be decoded as a single direction in the activ ation space. W e validate this hypothesis using Logit Lens ( nostalgebraist , 2020 ) and extract steering vectors that capture the recov ery mechanism. 4.1. The Behavior on Original Samples Section 3 shows that LRMs secretly detect the injected error and recov er in the final answer . W e inv estigate the behavior of LRMs on original samples and measure the accuracy of the thinking and the final answer separately . T able 1 demonstrates that the CoT and the final answer are usually consistent. When the thinking part is wrong, the model hardly fix it and thus returns the incorrect final answer . The recov ery rate is only 1 . 9% for R1-8B on GSM8k and less than 1% in e very other case. These results suggest that LRMs beha ve dif ferently and do not inv oke the critique behavior without the injected error . 4.2. The Linear Separability of Internal Thoughts T o further in vestigate, we compare the latent features of intervened samples in Section 3 and original samples. Let c ∼ y q denote the answer in c , which could be the CoT or the final answer, aligns with the ground-truth y q . W e con- struct pairs of runs with identical prompts for 1,000 GSM8K training questions Q train : a baseline run with correct reason- ing ( c ∼ y q ) and an intervened run with incorrect reasoning ( e + ˜ c ∼ y q ), both of which yield correct final answers ( a ∼ y q , ˜ a ∼ y q ). As discussed abov e, in the final answer the model performs hidden critique ability to recov er from mistakes. Therefore, we inspect the hidden states of the final answers and extract the a verage feature vectors across answer tokens at each layer ℓ : h ( ℓ ) baseline ( q ) = 1 | a | X i ∈ a h ( ℓ ) i ( q , c, a ) , h ( ℓ ) intervened ( q ) = 1 | ˜ a | X i ∈ ˜ a h ( ℓ ) i ( q , e + ˜ c, ˜ a ) , (4) where c ∼ y q , a ∼ y q , e + ˜ c ∼ y q , ˜ a ∼ y q . For each layer ℓ , we train a linear probe f ℓ ( h ) = W ℓ h + b ℓ with W ℓ ∈ R 2 × d , where d is the size of feature v ectors, to distinguish the baseline from intervened activ ations. The probe is trained with AdamW (learning rate 10 − 3 ) for 100 epochs, with a cosine schedule. W e ev aluate on in-distribution (ID) GSM8K-Error and out- of-distribution (OOD) MA TH500-Error, reporting layer- wise Area Under the Receiver Operating Characteristic Curve (A UR OC) and Expected Calibration Error (ECE). In T able 2 , we report the linear probing results across mod- els, selecting for each model the layer with the highest ID A UR OC and lo west ID ECE. Results for all layers and mod- 4 Decoding the Critique Mechanism in Large Reasoning Models T able 1. The distribution of the correctness of thinking processes and final answers on GSM8K and MA TH500 datasets. Cases GSM8K MA TH500 Qwen3-4B R1-8B R1-14B R1-32B Qwen3-4B R1-8B R1-14B R1-32B × Thinking, × Answer 3 . 9 % 15 . 6 % 6 . 8 % 5 . 0 % 1 . 5 % 5 . 0 % 4 . 1 % 2 . 4 % ✓ Thinking, ✓ Answer 95 . 8 % 81 . 8 % 92 . 4 % 94 . 0 % 98 . 3 % 95 . 0 % 95 . 5 % 97 . 4 % ✓ Thinking, × Answer 0 . 0 % 0 . 7 % 0 . 2 % 0 . 5 % 0 . 2 % 0 . 0 % 0 . 0 % 0 . 0 % × Thinking, ✓ Answer 0 . 2 % 1 . 9 % 0 . 5 % 0 . 5 % 0 . 0 % 0 . 0 % 0 . 4 % 0 . 2 % els are provided in the Appendix D . W e observe consistently high probing performance of all models on both datasets with almost perfect A UROC, indi- cating that the underlying acti vations remain linearly sepa- rable across all ev aluated models despite the matching final answers. The linear separability suggests that the hidden self-correction ability discov ered in Section 3 corresponds to a single linear direction in the latent space. T able 2. Linear probing results. Comparison of A UR OC and ECE ev aluated on GSM8K (ID) and MA TH500 (OOD) across four different models. The reported results are selected for the best layer . Detailed results can be found in Appendix D . Model GSM8K MA TH500 Layer A UR OC ECE A UR OC ECE Qwen3-4B 1.000 0.002 1.000 0.026 21 R1-8B 0.998 0.019 0.996 0.022 31 R1-14B 1.000 0.003 0.998 0.012 28 R1-32B 1.000 0.002 0.997 0.008 38 4.3. Extracting the Critique V ector Follo wing pre vious w orks ( T urner et al. , 2024 ; Rimsk y et al. , 2024 ), we identify the critique vector as the difference- in-means between contrastiv e pairs of runs constructed in Eqn 4 . W e consider the intervened runs in which the model performs hidden self-correction as positi ve e xamples, and the baseline runs in which the model merely summarizes the CoTs without in voking reco very as neg ative e xamples. Formally , we compute the critique vector at layer ℓ as the mean dif ference between intervened and baseline acti vations ov er the training dataset Q train : v ℓ = 1 |Q train | X q ∈Q train h ( ℓ ) intervened ( q ) − h ( ℓ ) baseline ( q ) . (5) 4.4. Steering with the Critique V ector W e can use the critique vector in Eqn 5 to causally control and modulate the model’ s hidden critique/recovery ability . Follo wing the standard steering approach ( V enhoff et al. , 2025 ; Sun et al. , 2025 ; Arditi et al. , 2024 ; Chen et al. , 2025c ; Y ang et al. , 2025b ), we add a scaled steering vector to the residual stream immediately after the MLP layer: h ( ℓ ) steered = h ( ℓ ) + α v ℓ , (6) where h ( ℓ ) is the post-MLP residual stream activ ation at layer ℓ , v ℓ is the steering vector at layer ℓ , and α ∈ R is a scalar steering coefficient. The modified activ ations h ( ℓ ) steered are then propagated through subsequent layers, allo wing the steering direction to shape the model’ s downstream genera- tion. Intuitively , this steering vector represents the critique ability of LRMs, which helps them detect and reco ver from the injected mistakes. Positiv e values of α amplify the model’ s self-critique ability , whereas neg ativ e v alues sup- press this behavior . Further details on how we select the layer ℓ for steering are provided in Appendix C . 4.5. Logit Lens Interpr etation T able 3. Comparison of the top 10 predicted tokens and their associated logit scores for the Qwen3-4B and R1-32B models. Rank Qwen3-4B R1-32B T oken Logit T oken Logit 1 wrongly 1.78 W ait 2.68 2 Nope 1.72 但 实 际 上 (But actually) 2.65 3 而 且 还 (And also) 1.63 却 (But/Y et) 2.64 4 really 1.61 howe ver 2.60 5 that 1.60 but 2.58 6 Actually 1.60 不 过 (Howe ver) 2.57 7 但 却 (But yet) 1.59 W ait 2.55 8 ?) \ n \ n 1.57 ( 2.49 9 oque 1.57 wait 2.41 10 actually 1.56 Howe ver 2.41 In this section, we in v estigate the meaning of v ℓ using Logit Lens ( nostalgebraist , 2020 ), which estimates the model’ s belief of its internal states. Specifically , we project v ℓ into the vocab ulary space using the unembedding matrix W U ∈ R d ×| V | , where | V | denotes the size of the vocab ulary: logits ℓ = v ℓ W ⊤ U ∈ R | V | . (7) 5 Decoding the Critique Mechanism in Large Reasoning Models W e interpret v ℓ by extracting tokens corresponding to the highest logit values. T able 3 lists the top 10 tokens for Qwen3-4B (layer 29) and R1-32B (layer 57), ranked by their projected logits ℓ . Notably , some of the highest-ranked tokens include ‘Nope’ or ‘W ait’ , which typically occur when the model tries to reassess its reasoning and answers. In addition to the English tokens, the steering vectors v ℓ for both Qwen3-4B and R1-32B are closely linked to several Chinese adversati ve mark ers such as: ‘ 但 却 ’ (but yet), ‘ 但 实 际 上 ’ (but actually), or ‘ 却 ’ (but/yet). This suggests that the identified steering vectors v ℓ effecti vely encode the critique behavior . Additional results for multiple layers can be found in Appendix B . 5. Experiments In this section, we confirm the role of the critique vector found in Section 4.3 . As discussed above, the critique abil- ity allows LRMs to detect errors in CoTs and corr ect these errors in the final answer . W e causally interv ene on the fea- ture space with the critique vector to e valuate the detection ability in Section 5.2 . W e demonstrate the importance of the correction ability for test-time scaling in Section 5.3 . 60.0 65.0 70.0 75.0 Qwen3-4B Error Accuracy 92.0 94.0 96.0 Correct Accuracy 74.0 76.0 78.0 80.0 82.0 84.0 F1 Score 32.5 35.0 37.5 40.0 42.5 R1-8B 40.0 42.5 45.0 47.5 50.0 39.0 40.0 41.0 42.0 43.0 44.0 58.0 60.0 62.0 64.0 R1-14B 91.0 92.0 93.0 71.0 72.0 73.0 74.0 75.0 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 Steering Coefficient 60.0 62.5 65.0 67.5 70.0 R1-32B − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 Steering Coefficient 93.0 94.0 95.0 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 Steering Coefficient 74.0 76.0 78.0 80.0 F igur e 3. Effect of dif ferent steering coefficients α ev aluated on ProcessBench . Left to right : Error detection accuracy , correct solution accuracy , and F1 score. Increasing α generally improves error detection and F1 scores while leading to a decline in correct solution accuracy . The dashed vertical line indicates the baseline performance at α = 0 . 5.1. Experimental Details Datasets. W e intervene on two widely-used mathe- matics benchmarks, GSM8K ( Cobbe et al. , 2021 ) and 45.0 47.5 50.0 52.5 55.0 Qwen3-4B Error Accuracy 76.0 78.0 80.0 82.0 Correct Accuracy 50.0 52.5 55.0 57.5 60.0 F1 Score 15.0 20.0 25.0 R1-8B 57.5 60.0 62.5 65.0 67.5 20.0 25.0 30.0 35.0 46.0 48.0 50.0 R1-14B 86.0 87.0 88.0 89.0 48.0 50.0 52.0 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 Steering Coefficient 46.0 48.0 50.0 52.0 R1-32B − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 Steering Coefficient 82.0 84.0 86.0 88.0 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 Steering Coefficient 50.0 52.0 54.0 56.0 F igur e 4. Effect of dif ferent steering coefficients α ev aluated on BIG-Bench Mistake . Left to right : Error detection accuracy , correct solution accuracy , and F1 score. The results illustrate similar effects to Figure 3 , where there is a monotonic relationship between α and error detection. The dashed vertical line indicates the baseline performance at α = 0 . MA TH500 ( Hendrycks et al. , 2021 ), to observe model behavior under injected errors. The GSM8K training split is used to train linear probes and derive steering vectors. For error detection ev aluation, we use two human-annotated benchmarks: ProcessBench ( Zheng et al. , 2025 ) for Olympiad-le vel mathematics and BIG-Bench- Mistake ( T yen et al. , 2024 ) for div erse logical reasoning tasks. For test-time scaling and self-correction analysis, we use our constructed GSM8K-Error and MA TH500-Error datasets alongside BIG-Bench-Mistake, which supports cor- rection from erroneous chains-of-thought. Models. W e e v aluate two open-source reasoning model fam- ilies: Qwen3 ( Y ang et al. , 2025a ) and DeepSeek-R1 ( Guo et al. , 2025 ). W e test models across dif ferent scales: Qwen3- 4B , DeepSeek-R1-Distill-Llama-8B , DeepSeek-R1-Distill- Qwen-14B , and DeepSeek-R1-Distill-Qwen-32B , covering both Llama and Qwen architectures within R1. Throughout this paper , we refer to these models as Qwen3-4B, R1-8B, R1-14B, and R1-32B respectiv ely . Evaluation. W e use xV erify-0.5B-I to v erify both reasoning traces and final answers, as it provides state-of-the-art math ev aluation without requiring human annotation and av oids template mismatch ( Chen et al. , 2025a ). For feature analysis and steering experiments, we use greedy decoding (temper - ature 0.0) to study deterministic internal behavior . For steer- ing experiments, we set the temperature to 0.6 following 6 Decoding the Critique Mechanism in Large Reasoning Models official guidelines to pre vent reasoning loops ( Pipis et al. , 2025 ), where the model gets stuck in repetiti ve thinking patterns, a common issue in reasoning models that we also observed with greedy decoding on challenging tasks. W e fix all random seeds to 0 to ensure complete reproducibility across runs. All experiments use two H100 GPUs. Prob lem Statement (BIG-Bench Arithmetic) Expression: (((7 + 2 − 4 ∗ − 5) ∗ (4 + − 3 ∗ − 2 − 4)) ∗ ((4 + − 2 ∗ 8 ∗ 3) ∗ (8 + − 3 ∗ 1 ∗ 5))) = T arget: 53592 Question: Which calculation is incorrect? The equation can be wr itten as “(A * B) * (C * D)”... Let’ s calculate A = (7 + 2 − 4 ∗ − 5) = (7 + 2 − ( − 20)) = (7 + 2 + 20) = 30 . Let’ s calculate B = 6. C = -44. D = -7. Then, the final equation is (30 ∗ 6) ∗ ( − 44 ∗ − 7) = 180 ∗ 308 = 55440 . Baseline Model (Coefficient 0.0) Calculating A = (7 + 2 − 4 ∗ − 5) . The reasoning here is to handle multiplication first. So, 4 ∗ − 5 is − 20 . Then, substituting back: 7 + 2 − ( − 20) . Subtracting a negative is like adding, so 7 + 2 + 20 equals 30 . That looks correct. ... [Checks B , C, and D] ... W ait, is there any step that might have an error? Let me go through each calculation again. A: 7 + 2 − 4 ∗ − 5 . 4 ∗ − 5 is − 20 . So, 7 + 2 is 9 , minus − 20 is 9 + 20 = 30 . Correct. Outcome: The model fails to spot the arithmetic error ( 9 + 20 = 30 ). P ositive Steering (Coefficient +1.0) Calculating A = (7 + 2 − 4 ∗ − 5) ... 4 ∗ − 5 = − 20 . Substituting back... 7 + 2 + 20 = 30 . That looks correct. ... [Checks B , C, and D] ... W ait, let me double-check... Rechecking A: 7 + 2 is 9 , minus ( − 20) is 9 + 20 . Wait, hold on! 7 + 2 is 9 , and subtracting − 20 is adding 20 , so 9 + 20 is 29 , not 30 . Oh! Conclusion: The calculation of A as 30 is wrong. Outcome: LRM identifies & fixes the err or . F igur e 5. Qualitativ e example of positi ve steering on BIG-Bench arithmetic task with R1-32B. There is a subtle arithmetic error introduced in the prompt ( 7 + 2 + 20 = 30 , whereas it should be 29 ). While the baseline model hallucinates that the math is correct during verification, positi ve steering enables the model to successfully catch the addition error . 5.2. Steering for Error Detection Setup. W e conduct experiments on ProcessBench and BIG- Bench Mistake, which contains pairs of questions and cor - responding CoT style traces. There is either one or no incorrect thought in those CoTs. W e prompt LRMs to return the index of incorrect thoughts or return − 1 if there is no mistake (details in Appendix A.2 ). W e compare LRMs’ answers with the ground-truth labels and report the accu- racy on CoTs with mistake (Error Accuracy), CoTs without mistake (Correct Accuracy), and F1 Score on all samples. W e report these metrics consistently with the original imple- mentation of these two datasets. Prob lem Statement (BIG-Bench Arithmetic) Expression: (((( − 6 − − 3) − ( − 4 + 4)) ∗ ((7 ∗ 0) − (0 − − 7))) − (((5 + 8) ∗ (9 + 6)) − ((5 ∗ − 4) ∗ (2 − − 1)))) = T arget: -234 Question: Which calculation is incorrect? The equation can be written as “A * B - C”... Let’ s calculate A = -3. Let’ s calculate B = ((7 ∗ 0) − (0 − − 7)) = ((0) − ( − 7)) = (0 + 7) = 7 . Let’ s calculate C = 255. Then, the final equation is ( − 3 ∗ 7) − 255 = − 276 . Baseline Model (Coefficient 0.0) The expression for B is ((7 ∗ 0) − (0 − − 7)) . First, 7 ∗ 0 = 0 . Then, 0 − ( − 7) = 7 . So, it becomes 0 − 7 , which is − 7 . In Thought 2, they computed it as 7 . Theref ore, Thought 2 is incorrect. Conclusion: The error is in Thought 2. Outcome: The model correctly identifies the arithmetic error ( B = − 7 ). Negative Steering (Coefficient -1.0) Calculating B = ((7 ∗ 0) − (0 − − 7)) . So that’ s 0 − ( − 7) , which is 0 + 7 = 7 . So B is 7. That seems correct. W ait, let me double-chec k... 0 − ( − 7) is adding 7. So B is 7. Conclusion: All steps seem to check out. No incorrect step found. Outcome: LRM fails to spot the error . F igur e 6. Qualitativ e example of neg ative steering on BIG-Bench arithmetic task with R1-32B. The problem contains a sign error in the calculation of term B ( 0 − (0 − − 7) = − 7 , not 7 ). While the baseline model correctly critiques the reasoning and identifies the error, negati ve steering (Coef ficient -1.0) makes the model hallucinate that the incorrect calculation is valid. Critique vector controls the detection ability . W e steer LRMs with the critique vector using multiple scales ranging from − 1 . 0 to 1 . 0 (Figures 3 and 4 ). Overall, there is a positi ve correlation between the steering coef ficients and the detection performance, confirming that the critique vector causally controls error detection. Additionally , Qwen3-4B yields better performance than R1-distilled models finetuned from Qwen2.5 and Llama3. On ProcessBench, Figure 3 shows that positi ve steering with coefficient 1 . 0 improv es the error detection accuracy by 5% on all models. In contrast, neg ative steering consistently decreases these figures. Similarly , Figure 4 illustrates the effecti veness of the critique vector on BIG-Bench Mistake, raising the error accuracy of Qwen3-4B by 4%: from 52% to 56% . Qualitative e xamples are shown in Figure 5 and 6 . W e notice that in some scenarios, positive steering could harm the correct accuracy , such as Qwen3-4B. W e provide qualitativ e samples in Appendix ?? . W e hypothesize that this intervention forces the model to revise the solution, potentially misunderstanding the correct thought and inad- vertently recognizing them as incorrect. Howe ver , the drop in correct accurac y is insignificant, and thus, we can still ob- 7 Decoding the Critique Mechanism in Large Reasoning Models serve the positiv e correlation between F1 Score and steering coefficients. 50.0 60.0 70.0 80.0 90.0 Qwen3-4B Accuracy(%) GSM8K-Error 80.0 85.0 90.0 95.0 MA TH500-Error 60.0 65.0 70.0 75.0 BIG-Bench Mistake 50.0 60.0 70.0 80.0 R1-8B Accuracy(%) 40.0 50.0 60.0 70.0 80.0 90.0 40.0 50.0 60.0 70.0 70.0 75.0 80.0 85.0 90.0 95.0 R1-14B Accuracy(%) 70.0 80.0 90.0 50.0 60.0 70.0 0 1 2 3 4 5 Iteration 70.0 75.0 80.0 85.0 90.0 95.0 R1-32B Accuracy(%) 0 1 2 3 4 5 Iteration 60.0 70.0 80.0 90.0 0 1 2 3 4 5 Iteration 55.0 60.0 65.0 70.0 75.0 Baseline Positiv e Steering Negative Steering No Wait F igur e 7. Impact of steering on model accuracy for test-time scal- ing. Left to right: Accuracy across three benchmarks: GSM8K- Error , MA TH500-Error , and BIG-Bench Mistake. For increasing iterations of test-time scaling, positive steering consistently en- hances the performance of all models. 5.3. T est-Time Scaling with Critique Steering Setup. W e e valuate the controllability of error recov ery performance on GSM8K-Error , MA TH500-Error, and BIG- Bench Mistake. For GSM8K-Error and MA TH500-Error, we let the model generate from the injected error until it fin- ishes the final answer . For BIG-Bench Mistake, we put the giv en CoT in the thinking part and let the model produce the final answer . More details of the test-time scaling instruction prompt can be found at Appendix A.3 . W e compare the final answer with the ground-truth label to compute the accurac y . T o check if test-time scaling help LRMs fix the injected errors, we adopt the budget forcing approach ( Muennighof f et al. , 2025 ) and append the “W ait” token when the model wants to terminate the thinking process. W e report the per- formance at dif ferent numbers of iterations corresponding to the number of appended “W ait” tokens. For the effect of the critique vector , we perform positi ve steering with a coefficient 1 . 0 and neg ativ e steering with a coef ficient − 1 . 0 . Critique vector steers the recovery performance. Fig- ure 7 depicts that steering with the critique v ector success- fully controls the ability to recov er from injected errors. Particularly , the original Qwen3-4B yields 77% accuracy on GSM8K-Error , while positiv e steering boosts that value to almost 90% . In contrast, negati ve steering de grades the performance to slightly abov e 50% . Similar trends can be observed on other models (R1-distilled series) and datasets (MA TH500-Error and BIG-Bench Mistake). This confirms that the critique ability plays a vital role in error correction and is represented by our identified critique vector . T est-time scaling corrects the injected mistake. Besides the hidden critique ability , we expect that explicit reflec- tion by manually appending “W ait” can also reco ver from injected errors. Indeed, Figure 7 sho ws that more budget forcing reliably improves the accuracy on three datasets across all models. This observation aligns with previous studies ( Muennighoff et al. , 2025 ; Snell et al. , 2025 ) e xibit- ing the effecti veness of test-time scaling. Critique vector further enhance test-time scaling. Ho w- ev er , test-time scaling alone does not reliably repair the incorrect CoT . W e observe that positiv e steering consis- tently boosts the accuracy of test-time scaling on all forcing iterations. These results highlight the impact of the hidden critique ability , suggesting that it is complementary to and cannot be completely captured by explicit reflection. On the other hand, neg ative steering also reduces the accuracy on all iterations, indicating that test-time scaling might not be successful without the critique ability . 6. Conclusion In this paper , we discov er the nov el hidden critique behav- ior where LRMs recov er from injected mistakes in CoTs and produce correct final answers, but do not v erbalize that error-detection and error-correction process. Through fea- ture analysis, we show that LRMs internally distinguish the latent features of intervened samples with hidden critique behavior and those of original samples without hidden cri- tique behavior . W e identify the critique vector representing this hidden critique mechanism and validate it through logit lens analysis. Our empirical evidence sho ws that steering with this vector successfully controls the error detection performance on several benchmarks across many models. In addition, we point out that test-time scaling relies on the critique ability and can be combined with critique steer- ing to further enhance the correction performance. Our findings shed light on the role of the critique ability in cur- rent LRMs and open man y interesting questions. Future extension should in vestigate if this behavior is the result of pre-training or post-training and ho w it emerges. Since the hidden critique ability makes LRMs correct errors in the final answer b ut not verbalize in CoTs, we can analyze its impact on the faithfulness of CoTs. Furthermore, it is interesting to study the influence of this ability on other prop- erties, for example, alignment. Finally , we can le verage this ability to improv e other test-time scaling methods, such as verifier -based approach, self-consistency , or self-correction prompting. W e lea ve them to future works. 8 Decoding the Critique Mechanism in Large Reasoning Models Impact Statement Error-correction and error -detection are beneficial to guar - antee the quality of responses pro vided by language models, which is important for high-stakes applications such as med- ical domains or scientific discov ery . Our analysis reveals the mechanism of this behavior , and our proposed methods allow controlling error-recov ery performance, advancing the dev elopment of reliable and trustworthy AI systems. References Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoha- ran, S., Nanda, N., and Conmy , A. Chain-of-thought reasoning in the wild is not always f aithful, 2025. URL https://arxiv.org/abs/2503.08679 . Arditi, A., Obeso, O. B., Syed, A., P aleka, D., Rimsky , N., Gurnee, W ., and Nanda, N. Refusal in language models is mediated by a single direction. In The Thirty-eighth Annual Conference on Neural Information Pr ocessing Systems , 2024. URL https://openreview.net/ forum?id=pH3XAQME6c . Chen, D., Y u, Q., W ang, P ., Hu, M., Zhang, W ., W ang, Z., T ang, B., Xiong, F ., Li, X., W ang, C., Y ang, M., and Li, Z. xverify: Efficient answer verifier for reason- ing model ev aluations, 2025a. URL https://arxiv. org/abs/2504.10481 . Chen, J., Ren, J., Chen, X., Y ang, C., Sun, R., Y oon, J., and Arik, S. O. SETS: Le veraging self-verification and self-correction for impro ved test-time scaling. T ransac- tions on Machine Learning Resear c h , 2025b. ISSN 2835- 8856. URL https://openreview.net/forum? id=Wv9NMJoKww . Chen, R., Zhang, Z., Hong, J., Kundu, S., and W ang, Z. SEAL: Steerable reasoning calibration of large language models for free. In Second Conference on Languag e Mod- eling , 2025c. URL https://openreview.net/ forum?id=klPszYDIRT . Chen, X., Xu, J., Liang, T ., He, Z., Pang, J., Y u, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., W ang, R., Tu, Z., Mi, H., and Y u, D. Do NO T think that much for 2+3=? on the overthinking of long reasoning models. In F orty- second International Confer ence on Mac hine Learning , 2025d. URL https://openreview.net/forum? id=MSbU3L7V00 . Chen, Y ., Benton, J., Radhakrishnan, A., Uesato, J., Deni- son, C., Schulman, J., Somani, A., Hase, P ., W agner , M., Roger , F ., Mikulik, V ., Bowman, S. R., Leike, J., Kaplan, J., and Perez, E. Reasoning models don’t always say what they think, 2025e. URL abs/2505.05410 . Cobbe, K., Kosaraju, V ., Bav arian, M., Chen, M., Jun, H., Kaiser , L., Plappert, M., T worek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. T raining v erifiers to solv e math word problems, 2021. URL https://arxiv. org/abs/2110.14168 . Fan, C., Li, M., Sun, L., and Zhou, T . Missing premise exacerbates o verthinking: Are reasoning models losing critical thinking skill? In Second Confer ence on Lan- guage Modeling , 2025. URL https://openreview. net/forum?id=ufozo2Wc9e . Ghosal, S. S., Chakraborty , S., Reddy , A., Lu, Y ., W ang, M., Manocha, D., Huang, F ., Ghav amzadeh, M., and Bedi, A. S. Does thinking more always help? mirage of test- time scaling in reasoning models. In The Thirty-ninth Annual Conference on Neural Information Pr ocessing Systems , 2025. URL https://openreview.net/ forum?id=tKPqbamNb9 . Guo, D., Y ang, D., Zhang, H., Song, J., W ang, P ., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Y u, X., W u, Y ., W u, Z. F ., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., W ang, B., W u, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F ., Dai, F ., Luo, F ., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Ding, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Chen, J., Y uan, J., T u, J., Qiu, J., Li, J., Cai, J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Y ou, K., Gao, K., Guan, K., Huang, K., Y u, K., W ang, L., Zhang, L., Zhao, L., W ang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., T ang, M., Zhou, M., Li, M., W ang, M., Li, M., Tian, N., Huang, P ., Zhang, P ., W ang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., W ang, R., Chen, R. J., Jin, R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Y e, S., W ang, S., Y u, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., W u, S., Y un, T ., Pei, T ., Sun, T ., W ang, T ., Zeng, W ., Liu, W ., Liang, W ., Gao, W ., Y u, W ., Zhang, W ., Xiao, W . L., An, W ., Liu, X., W ang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Y ang, X., Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen, X., Sun, X., W ang, X., Song, X., Zhou, X., W ang, X., Shan, X., Li, Y . K., W ang, Y . Q., W ei, Y . X., Zhang, Y ., Xu, Y ., Li, Y ., Zhao, Y ., Sun, Y ., W ang, Y ., Y u, Y ., Zhang, Y ., Shi, Y ., Xiong, Y ., He, Y ., Piao, Y ., W ang, Y ., T an, Y ., Ma, Y ., Liu, Y ., Guo, Y ., Ou, Y ., W ang, Y ., Gong, Y ., Zou, Y ., He, Y ., Xiong, Y ., Luo, Y ., Y ou, Y ., Liu, Y ., Zhou, Y ., Zhu, Y . X., Huang, Y ., Li, Y ., Zheng, Y ., Zhu, Y ., Ma, Y ., T ang, Y ., Zha, Y ., Y an, Y ., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Y an, Z., W u, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., and Zhang, Z. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature , 645(8081):633–638, September 2025. ISSN 1476-4687. 9 Decoding the Critique Mechanism in Large Reasoning Models doi: 10.1038/s41586- 025- 09422- z. URL http://dx. doi.org/10.1038/s41586- 025- 09422- z . He, Y ., Li, S., Liu, J., W ang, W ., Bu, X., Zhang, G., Peng, Z., Zhang, Z., Zheng, Z., Su, W ., and Zheng, B. Can large language models detect errors in long chain-of-thought reasoning? In Che, W ., Nabende, J., Shutov a, E., and Pilehvar , M. T . (eds.), Pr oceedings of the 63r d Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long P apers) , pp. 18468–18489, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251- 0. doi: 10.18653/v1/2025.acl- long.905. URL https: //aclanthology.org/2025.acl- long.905/ . Hendrycks, D., Burns, C., Kada vath, S., Arora, A., Basart, S., T ang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset, 2021. URL . Kamoi, R., Zhang, Y ., Zhang, N., Han, J., and Zhang, R. When can LLMs actually correct their o wn mis- takes? a critical surve y of self-correction of LLMs. T r ansactions of the Association for Computational Lin- guistics , 12:1417–1440, 2024. doi: 10.1162/tacl a 00713. URL https://aclanthology.org/ 2024.tacl- 1.78/ . Kang, M., Jeong, J., and Cho, J. T1: T ool-integrated self- verification for test-time compute scaling in small lan- guage models, 2025. URL abs/2504.04718 . K onen, K., Jentzsch, S., Diallo, D., Sch ¨ utt, P ., Bensch, O., El Baf f, R., Opitz, D., and Hecking, T . Style vectors for steering generati ve large language models. In Graham, Y . and Purver , M. (eds.), F indings of the Association for Computational Linguistics: EA CL 2024 , pp. 782–802, St. Julian’s, Malta, March 2024. Association for Computa- tional Linguistics. URL https://aclanthology. org/2024.findings- eacl.52/ . Kumar , A., Zhuang, V ., Agarwal, R., Su, Y ., Co-Re yes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., Zhang, L. M., McKinney , K., Shri vasta va, D., P aduraru, C., T ucker , G., Precup, D., Behbahani, F ., and Faust, A. T raining language models to self-correct via reinforce- ment learning. In The Thirteenth International Confer- ence on Learning Repr esentations , 2025. URL https: //openreview.net/forum?id=CjwERcAU7w . Lee, A., Sun, L., W endler , C., V i ´ egas, F ., and W attenberg, M. The geometry of self-v erification in a task-specific rea- soning model, 2025a. URL abs/2504.14379 . Lee, B. W ., Padhi, I., Ramamurthy , K. N., Miehling, E., Dognin, P ., Nagireddy , M., and Dhurandhar, A. Program- ming refusal with conditional activ ation steering, 2025b. URL . Lee, H., Oh, S., Kim, J., Shin, J., and T ack, J. Re- VISE: Learning to refine at test-time via intrinsic self- verification. In F orty-second International Confer- ence on Machine Learning , 2025c. URL https:// openreview.net/forum?id=cBtsxtJqEK . Li, J., Zhao, W ., Zhang, Y ., and Gan, C. Steering llm thinking with b udget guidance, 2025. URL https:// arxiv.org/abs/2506.13752 . Lightman, H., K osaraju, V ., Burda, Y ., Edwards, H., Baker , B., Lee, T ., Leike, J., Schulman, J., Sutskev er , I., and Cobbe, K. Let’ s verify step by step. In The T welfth International Confer ence on Learning Representations , 2024. URL https://openreview.net/forum? id=v8L0pN6EOi . Liu, R., Gao, J., Zhao, J., Zhang, K., Li, X., Qi, B., Ouyang, W ., and Zhou, B. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https: //arxiv.org/abs/2502.06703 . Muennighoff, N., Y ang, Z., Shi, W ., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer , L., Liang, P ., Candes, E., and Hashimoto, T . s1: Simple test-time scaling. In Christodoulopoulos, C., Chakraborty , T ., Rose, C., and Peng, V . (eds.), Pr oceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Pro- cessing , pp. 20275–20321, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp- main. 1025. URL https://aclanthology.org/2025. emnlp- main.1025/ . nostalgebraist. Interpreting gpt: The logit lens. https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting- gpt- the- logit- lens , 2020. LessWrong. OpenAI, :, Jaech, A., Kalai, A., Lerer , A., Richardson, A., El-Kishky , A., Low , A., Helyar , A., Madry , A., Beu- tel, A., Carney , A., Iftimie, A., Karpenko, A., Passos, A. T ., Neitz, A., Prokofie v , A., W ei, A., T am, A., Bennett, A., Kumar , A., Sarai v a, A., V allone, A., Duberstein, A., K ondrich, A., Mishchenko, A., Applebaum, A., Jiang, A., Nair , A., Zoph, B., Ghorbani, B., Rossen, B., Sok olowsk y , B., Barak, B., McGre w , B., Minaie v , B., Hao, B., Baker , B., Houghton, B., McKinzie, B., Eastman, B., Lugaresi, C., Bassin, C., Hudson, C., Li, C. M., de Bourc y , C., V oss, C., Shen, C., Zhang, C., Koch, C., Orsinger, C., Hesse, C., Fischer , C., Chan, C., Roberts, D., Kappler , D., Levy , 10 Decoding the Critique Mechanism in Large Reasoning Models D., Selsam, D., Dohan, D., Farhi, D., Mely , D., Robinson, D., Tsipras, D., Li, D., Oprica, D., Freeman, E., Zhang, E., W ong, E., Proehl, E., Cheung, E., Mitchell, E., W al- lace, E., Ritter , E., Mays, E., W ang, F ., Such, F . P ., Raso, F ., Leoni, F ., Tsimpourlas, F ., Song, F ., von Lohmann, F ., Sulit, F ., Salmon, G., Parascandolo, G., Chabot, G., Zhao, G., Brockman, G., Leclerc, G., Salman, H., Bao, H., Sheng, H., Andrin, H., Bagherinezhad, H., Ren, H., Lightman, H., Chung, H. W ., Kivlichan, I., O’Connell, I., Osband, I., Gilaberte, I. C., Akkaya, I., K ostrikov , I., Sutske ver , I., Kofman, I., Pachocki, J., Lennon, J., W ei, J., Harb, J., T wore, J., Feng, J., Y u, J., W eng, J., T ang, J., Y u, J., Candela, J. Q., Palermo, J., Parish, J., Heidecke, J., Hallman, J., Rizzo, J., Gordon, J., Uesato, J., W ard, J., Huizinga, J., W ang, J., Chen, K., Xiao, K., Singhal, K., Nguyen, K., Cobbe, K., Shi, K., W ood, K., Rimbach, K., Gu-Lemberg, K., Liu, K., Lu, K., Stone, K., Y u, K., Ahmad, L., Y ang, L., Liu, L., Maksin, L., Ho, L., Fedus, L., W eng, L., Li, L., McCallum, L., Held, L., Kuhn, L., K ondraciuk, L., Kaiser , L., Metz, L., Boyd, M., T rebacz, M., Joglekar , M., Chen, M., T intor , M., Me yer , M., Jones, M., Kaufer , M., Schwarzer , M., Shah, M., Y atbaz, M., Guan, M. Y ., Xu, M., Y an, M., Glaese, M., Chen, M., Lampe, M., Malek, M., W ang, M., Fradin, M., McClay , M., Pa vlov , M., W ang, M., W ang, M., Murati, M., Ba var - ian, M., Rohaninejad, M., McAleese, N., Chowdhury , N., Chowdhury , N., Ryder , N., T ezak, N., Brown, N., Nachum, O., Boiko, O., Murk, O., W atkins, O., Chao, P ., Ashbourne, P ., Izmailov , P ., Zhokhov , P ., Dias, R., Arora, R., Lin, R., Lopes, R. G., Gaon, R., Miyara, R., Leike, R., Hwang, R., Gar g, R., Bro wn, R., James, R., Shu, R., Cheu, R., Greene, R., Jain, S., Altman, S., T oizer , S., T oyer , S., Miserendino, S., Agarwal, S., Hernandez, S., Baker , S., McKinney , S., Y an, S., Zhao, S., Hu, S., Santurkar , S., Chaudhuri, S. R., Zhang, S., Fu, S., Papay , S., Lin, S., Bal- aji, S., Sanjeev , S., Sidor , S., Broda, T ., Clark, A., W ang, T ., Gordon, T ., Sanders, T ., Patw ardhan, T ., Sottiaux, T ., Degry , T ., Dimson, T ., Zheng, T ., Garipov , T ., Stasi, T ., Bansal, T ., Creech, T ., Peterson, T ., Eloundou, T ., Qi, V ., K osaraju, V ., Monaco, V ., Pong, V ., Fomenko, V ., Zheng, W ., Zhou, W ., McCabe, W ., Zaremba, W ., Dubois, Y ., Lu, Y ., Chen, Y ., Cha, Y ., Bai, Y ., He, Y ., Zhang, Y ., W ang, Y ., Shao, Z., and Li, Z. Openai o1 system card, 2024. URL https://arxiv.org/abs/2412.16720 . Pipis, C., Garg, S., K ontonis, V ., Shriv astav a, V ., Kr- ishnamurthy , A., and Papailiopoulos, D. W ait, wait, wait... why do reasoning models loop?, 2025. URL https://arxiv.org/abs/2512.12895 . Rimsky , N., Gabrieli, N., Schulz, J., T ong, M., Hubinger , E., and T urner , A. Steering llama 2 via contrastiv e activ a- tion addition. In K u, L.-W ., Martins, A., and Srikumar , V . (eds.), Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P aper s) , pp. 15504–15522, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl- long.828. URL https: //aclanthology.org/2024.acl- long.828/ . Setlur , A., Rajaraman, N., Le vine, S., and Kumar , A. Scaling test-time compute without v erification or RL is subopti- mal. In F orty-second International Conference on Ma- chine Learning , 2025. URL https://openreview. net/forum?id=beeNgQEfe2 . Snell, C. V ., Lee, J., Xu, K., and Kumar , A. Scaling LLM test-time compute optimally can be more effecti ve than scaling parameters for reasoning. In The Thirteenth International Confer ence on Learning Representations , 2025. URL https://openreview.net/forum? id=4FWAwZtd2n . Stolfo, A., Balachandran, V ., Y ousefi, S., Horvitz, E., and Nushi, B. Improving instruction-following in language models through activ ation steering, 2025. URL https: //arxiv.org/abs/2410.12877 . Sun, C.-E., Y an, G., and W eng, T .-W . ThinkEdit: Interpretable weight editing to mitigate overly short thinking in reasoning models. In Christodoulopou- los, C., Chakraborty , T ., Rose, C., and Peng, V . (eds.), Pr oceedings of the 2025 Conference on Em- pirical Methods in Natural Language Pr ocessing , pp. 17001–17025, Suzhou, China, Nov ember 2025. As- sociation for Computational Linguistics. ISBN 979- 8-89176-332-6. doi: 10.18653/v1/2025.emnlp- main. 861. URL https://aclanthology.org/2025. emnlp- main.861/ . T urner , A. M., Thiergart, L., Leech, G., Udell, D., V azquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activ ation engineering, 2024. URL https: //arxiv.org/abs/2308.10248 . T urpin, M., Michael, J., Perez, E., and Bowman, S. R. Lan- guage models don’t always say what they think: Un- faithful e xplanations in chain-of-thought prompting. In Thirty-seventh Conference on Neural Information Pro- cessing Systems , 2023. URL https://openreview. net/forum?id=bzs4uPLXvi . T yen, G., Mansoor , H., Carb une, V ., Chen, P ., and Mak, T . LLMs cannot find reasoning errors, but can correct them given the error location. In Ku, L.-W ., Martins, A., and Srikumar , V . (eds.), F indings of the Association for Computational Linguistics: ACL 2024 , pp. 13894– 13908, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. findings- acl.826. URL https://aclanthology. org/2024.findings- acl.826/ . 11 Decoding the Critique Mechanism in Large Reasoning Models V enhoff, C., Arcuschin, I., T orr , P ., Conmy , A., and Nanda, N. Understanding reasoning in thinking language models via steering v ectors. In W orkshop on Reasoning and Plan- ning for Larg e Language Models , 2025. URL https: //openreview.net/forum?id=OwhVWNOBcz . W ang, P ., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., W u, Y ., and Sui, Z. Math-shepherd: V erify and reinforce llms step-by-step without human annotations. In Pr oceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 9426–9439, 2024. W eng, Y ., Zhu, M., Xia, F ., Li, B., He, S., Liu, S., Sun, B., Liu, K., and Zhao, J. Large language models are better reasoners with self-verification. In The 2023 Confer ence on Empirical Methods in Natural Language Pr ocessing , 2023. URL https://openreview.net/forum? id=s4xIeYimGQ . W u, T ., Xiang, C., W ang, J. T ., Suh, G. E., and Mittal, P . Effecti vely controlling reasoning models through think- ing intervention, 2025. URL abs/2503.24370 . W u, Z., Zeng, Q., Zhang, Z., T an, Z., Shen, C., and Jiang, M. Large language models can self-correct with key condition verification. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Pr oceedings of the 2024 Confer ence on Empirical Methods in Natu- ral Language Pr ocessing , pp. 12846–12867, Miami, Florida, USA, Nov ember 2024. Association for Compu- tational Linguistics. doi: 10.18653/v1/2024.emnlp- main. 714. URL https://aclanthology.org/2024. emnlp- main.714/ . Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv , C., Zheng, C., Liu, D., Zhou, F ., Huang, F ., Hu, F ., Ge, H., W ei, H., Lin, H., T ang, J., Y ang, J., Tu, J., Zhang, J., Y ang, J., Y ang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Y ang, K., Y u, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P ., W ang, P ., Zhu, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T ., T ang, T ., Y in, W ., Ren, X., W ang, X., Zhang, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Zhang, Y ., W an, Y ., Liu, Y ., W ang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. Qwen3 technical report, 2025a. URL https: //arxiv.org/abs/2505.09388 . Y ang, J., Li, R., W ang, W ., Zhou, Z., Feng, Z., and Peng, W . Lf-steering: Latent feature acti vation steering for enhanc- ing semantic consistency in lar ge language models, 2025b. URL . Y ang, S., Lee, S.-W ., Kassner , N., Gottesman, D., Riedel, S., and Gev a, M. How well can reasoning models identify and recover from unhelpful thoughts? In F indings of the Association for Computational Linguistics: EMNLP 2025 , pp. 7030–7047, Suzhou, China, No vember 2025c. Association for Computational Linguistics. ISBN 979- 8-89176-335-7. doi: 10.18653/v1/2025.findings- emnlp. 370. URL https://aclanthology.org/2025. findings- emnlp.370/ . Y ang, S., W u, J., Chen, X., Xiao, Y ., Y ang, X., W ong, D. F ., and W ang, D. Understanding aha moments: from exter - nal observations to internal mechanisms, 2025d. URL https://arxiv.org/abs/2504.02956 . Y ang, Z., Zhang, Y ., W ang, Y ., Xu, Z., Lin, J., and Sui, Z. Confidence v .s. critique: A decomposition of self- correction capability for LLMs. In Che, W ., Nabende, J., Shutov a, E., and Pilehvar , M. T . (eds.), Pr oceedings of the 63r d Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long P apers) , pp. 3998–4014, V ienna, Austria, July 2025e. Association for Computational Linguistics. ISBN 979-8-89176-251- 0. doi: 10.18653/v1/2025.acl- long.203. URL https: //aclanthology.org/2025.acl- long.203/ . Zeng, Z., Cheng, Q., Y in, Z., Zhou, Y ., and Qiu, X. Re- visiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? In Che, W ., Nabende, J., Shutov a, E., and Pilehvar , M. T . (eds.), Pr o- ceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 4651–4665, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251- 0. doi: 10.18653/v1/2025.acl- long.232. URL https: //aclanthology.org/2025.acl- long.232/ . Zhang, A., Chen, Y ., Pan, J., Zhao, C., Panda, A., Li, J., and He, H. Reasoning models kno w when they’re right: Probing hidden states for self-verification. In Second Confer ence on Language Modeling , 2025. URL https: //openreview.net/forum?id=O6I0Av7683 . Zhang, D., Zhoubian, S., Hu, Z., Y ue, Y ., Dong, Y ., and T ang, J. Rest-mcts* : Llm self-training via process reward guided tree search. In Globerson, A., Mackey , L., Bel- grav e, D., F an, A., Paquet, U., T omczak, J., and Zhang, C. (eds.), Advances in Neural Information Pr ocessing Sys- tems , volume 37, pp. 64735–64772. Curran Associates, Inc., 2024. doi: 10.52202/079017- 2066. Zhao, J., Sun, Y ., Shi, W ., and Song, D. Can aha moments be fake? identifying true and decorative thinking steps in chain-of-thought, 2025. URL abs/2510.24941 . Zheng, C., Zhang, Z., Zhang, B., Lin, R., Lu, K., Y u, B., Liu, D., Zhou, J., and Lin, J. ProcessBench: Identifying process errors in mathematical reasoning. In Che, W ., 12 Decoding the Critique Mechanism in Large Reasoning Models Nabende, J., Shutov a, E., and Pilehvar , M. T . (eds.), Pr o- ceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 1009–1024, V ienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251- 0. doi: 10.18653/v1/2025.acl- long.50. URL https: //aclanthology.org/2025.acl- long.50/ . 13 Decoding the Critique Mechanism in Large Reasoning Models A. System Pr ompt A.1. GPT -5 Instruction Prompt The following system prompt P err was used to generate incorrect reasoning steps: GPT -5 Instruction Prompt Giv en a question and its ground-truth answer , generate a natural reasoning trace that contains a calculation error . PURPOSE: This trace is a deliberate error variant used to test a reasoning model’ s ability to recover from arithmetic mistak es. INSTR UCTIONS: - Write the trace as a sequence of short, natural reasoning steps, phrased as if someone is thinking aloud. - Use natural, anthropomorphic openings such as “First, I need to... ”, “Okay , so... ”, “Let me... ”, or other common starter phrases typically found in the thinking process of reasoning models. - Separate steps with two ne wlines ( \ n \ n). - Do NO T mention or hint that an error occurred (no phrases like “mistake”, “should be”, or “wrong”). - Introduce an arithmetic error at the earliest plausible calculation step. - Stop the reasoning immediately after the mistaken calculation step; do not continue solving or produce an answer . The goal is to isolate the reasoning up to the error only . - The trace must be self-consistent and match the question up to the step before the error . - Keep the reasoning coherent, natural, and plausible. EXAMPLE: Question: Oli via buys 3 packs of markers, each pack containing 12 markers. She giv es 7 markers to her friend. How many markers does Oli via ha ve left? Ground-truth answer: 29 T race: “Okay , so I need to figure out ho w many markers Oli via got in total. She bought 3 packs, each with 12 markers. \ n \ nNext, I’ ll multiply 3 × 12 = 38 . ” Return only the interventions using the configured JSON schema. Question: { question } Ground-truth answer: { ground truth } A.2. Error Detection Prompt W e ev aluate reasoning error detection using tw o benchmarks: ProcessBench and BIG-Bench-Mistake, each with its o wn prompting format. For ProcessBench, we follow the original critique-based prompting scheme e xactly as implemented in the official repository . 1 The model is provided with a math problem and a paragraph-structured solution trace, and is instructed to identify the earliest paragraph containing an error , or return − 1 if no error is present. For BIG-Bench-Mistak e, we follow the original task-specific templates provided by the authors. 2 W e observe that strong reasoning models often solve the problem directly rather than performing error detection. T o maintain the original task structure while enforcing focused e valuation, we prepend a minimal instruction requiring the model to identify incorrect steps only . 1 https://github.com/QwenLM/ProcessBench 2 https://github.com/WHGTyen/BIG- Bench- Mistake 14 Decoding the Critique Mechanism in Large Reasoning Models ProcessBench Error Detection Prompt The following is a math problem and a solution (split into paragraphs, enclosed with tags and inde xed from 0): [Math Problem] { problem } [Solution] { tagged response } Y our task is to revie w and critique the solution paragraph by paragraph. Once you identify an error in a paragraph, return the index of the paragraph where the earliest error occurs. Otherwise, return the index of − 1 . Please put your final answer in \ boxed {} . BIG-Bench-Mistake T ask T emplate Giv en a problem and its solution steps below , identify whether there is any incorrect reasoning step. Do NOT solv e the problem. T ask: { task description } Problem: { input } Reasoning: { steps } Return \ boxed { k } where k is the step number of the incorrect step, or \ boxed { -1 } if all steps are correct. A.3. T est-time Scaling Prompt W e perform test-time scaling by injecting a corrupted chain-of-thought (CoT) into the model’ s intermediate thinking process and allowing the model to continue generating. This setup simulates partial erroneous reasoning and e valuates whether extended inference can reco ver from the injected mistake. Concretely , each generation follows the structure: T est-time Scaling Generation F ormat { question } ⟨ think ⟩ { injected erroneous CoT } { continued reasoning } ⟨ /think ⟩ { final answer } After the model completes its reasoning, we optionally append the token “W ait” multiple times to extend inference depth. This simple control mechanism encourages backtracking and re vision of earlier steps, following prior work (e.g., ( Muennighoff et al. , 2025 ; Zeng et al. , 2025 ; Ghosal et al. , 2025 )). 15 Decoding the Critique Mechanism in Large Reasoning Models B. Additional Results of Logit Lens Interpr etation T o validate the robustness of the identified steering vectors v ℓ , we analyze the trajectory of the top projected tokens with Logit Lens across fiv e consecutiv e layers. T ables 4 and 5 present the e volution of the top 10 tokens and their associated logit scores for Qwen3-4B (Layers 29–33) and R1-32B (Layers 55–58), respecti vely . Comparing the tw o models rev eals some qualitativ e differences in their self-critique mechanisms. While R1-32B (T able 5 tends toward the interrupti ve tokens such as ‘W ait’ or ‘ 但 实 际 上 ’ (but actually), Qwen3-4B (T able 4 ) exhibits a preference for direct ne gation and correction tokens, such as ‘wr ongly’ , ‘Nope’ , and ‘Actually’ throughout Layers 29–33. W e hypothesize that this distinction may suggest that the extracted vector in Qwen3-4B encodes a more immediate “fact-checking” signal, whereas the vector in R1-32B captures a specialized backtracking behavior . This may also explain the significant difference in proportions of ‘ × Think ✓ Answer’ and ‘ ✓ Think ✓ Answer’ in Figure 2 between the two models. T able 4. Evolution of top 10 predicted tok ens and logit scores across five layers (Layer 29 to Layer 33) for the Qwen3-4B model. Rank Layer 29 Layer 30 Layer 31 Layer 32 Layer 33 T oken Logit T oken Logit T oken Logit T oken Logit T oken Logit 1 wrongly 1.78 而 且 还 (And also) 2.28 really 2.49 wrongly 2.88 !) \ n 3.10 2 Nope 1.72 Nope 2.21 wrongly 2.45 !) \ n 2.69 wrongly 2.91 3 而 且 还 (And also) 1.63 wrongly 2.17 而 且 还 (And also) 2.42 —but 2.65 !) \ n \ n 2.80 4 really 1.61 really 2.11 Nope 2.38 而 且 还 (And also) 2.63 )! \ n 2.79 5 that 1.60 !) \ n 2.02 Actually 2.22 rethink 2.62 !! \ n \ n 2.77 6 Actually 1.60 只 不 过 (Just) 2.01 !) \ n 2.18 !) \ n \ n 2.62 !! \ n 2.75 7 但 却 (But yet) 1.59 Actually 2.00 只 不 过 (Just) 2.18 ?! \ n \ n 2.55 !!! \ n 2.74 8 ?) \ n \ n 1.57 !!! \ n 1.99 actually 2.17 !!! \ n 2.52 really 2.74 9 oque 1.57 !) \ n \ n 1.96 !!! \ n 2.15 !! \ n \ n 2.51 —but 2.74 10 actually 1.56 choc 1.94 somebody 2.15 —we 2.49 actually 2.74 T able 5. Evolution of top 10 predicted tok ens and logit scores across five layers (Layer 55 to Layer 58) for the R1-32B model. Rank Layer 55 Layer 56 Layer 57 Layer 58 T oken Logit T oken Logit T oken Logit T oken Logit 1 但 实 际 上 (But actually) 2.83 但 实 际 上 (But actually) 2.66 W ait 2.68 W ait 2.95 2 but 2.54 W ait 2.60 但 实 际 上 (But actually) 2.65 W ait 2.78 3 W ait 2.43 W ait 2.46 却 (But/Y et) 2.64 howe ver 2.78 4 但 由 于 (But due to) 2.42 howe ver 2.46 ho wev er 2.60 ( 2.74 5 但 现 在 (But now) 2.38 却 (But/Y et) 2.45 but 2.58 却 (But/Y et) 2.70 6 却 被 (But was) 2.37 but 2.36 不 过 (Howev er) 2.57 but 2.68 7 却 (But/Y et) 2.30 wait 2.31 W ait 2.55 wait 2.65 8 但 从 (But from) 2.28 但 由 于 (But due to) 2.30 ( 2.49 但 实 际 上 (But actually) 2.62 9 W ait 2.18 却 被 (But was) 2.28 wait 2.41 不 过 (However) 2.58 10 不 过 (Howe ver) 2.18 Howe ver 2.28 Howe ver 2.41 Howe ver 2.50 16 Decoding the Critique Mechanism in Large Reasoning Models C. How to Select the Optimal Layer f or Steering? Follo wing previous w orks ( Zhao et al. , 2025 ; Chen et al. , 2025c ; V enhoff et al. , 2025 ; Sun et al. , 2025 ), we only apply the steering vector to the LRM for one selected layer . In this section, we describe the procedure for identifying the optimal layer that is the most representati ve of the self-critique mechanism. Consistent with Sun et al. ( 2025 ), for each LRM, we apply the steering vector as defined in Eq. 6 to each layer independently using scaling factors α ∈ {− 1 . 0 , 1 . 0 } . All experiments for layer-wise steering are conducted on the GSM8K training split (Figure 8 ). W e define the optimal layer as the one which yields the greatest accuracy dif ferential between positiv e (1.0) steering and negati ve (-1.0) steering. At this layer, the ef fect of self-correction enhancement or suppression is the most pronounced, suggesting that this layer acts as the primary locus for the model’ s internal self-critique processes. As demonstrated in T able 6 and Figure 8 , the dif ferential between positiv e and negati ve steering ( ∆ ) is the most prominent for middle layers, with ∆ ev en exceeding 38% for Layer 21 of Qwen3-4B. 0 5 10 15 20 25 30 Lay er ` 45 50 55 60 65 70 75 Accuracy (%) R1-8B 0 5 10 15 20 25 30 35 40 45 Lay er ` 75 80 85 R1-14B 0 5 10 15 20 25 30 35 40 45 50 55 60 Lay er ` 70 75 80 85 Accuracy (%) R1-32B 0 5 10 15 20 25 30 35 Lay er ` 25 30 35 40 45 50 55 60 65 Qwen3-4B Positiv e (1.0) Negative (-1.0) Baseline F igur e 8. Layer-wise steering results. The separation between the performance of positiv e and negati ve steering is minimal for early and late layers. For middle layers, this performance g ap is the most noticeable. T able 6. Results of the optimal layer . Overall accuracy (%) of the optimal layer ev aluated on the training split of GSM8K for baseline ( α = 0 . 0 ), positiv e steering ( α = 1 . 0 ), negativ e steering ( α = − 1 . 0 ), and the difference between positi ve and negati ve steering ( ∆ ). Model α = 0 . 0 α = − 1 . 0 α = 1 . 0 ∆ Layer Qwen3-4B 42.36 25.00 63.72 38.72 21 R1-8B 57.03 53.36 73.48 20.13 13 R1-14B 80.99 73.71 85.64 11.93 28 R1-32B 79.42 69.74 86.25 16.51 44 17 Decoding the Critique Mechanism in Large Reasoning Models D. Additional Results of Linear Separability of Internal Thoughts In Section 4.2 , we hav e provided the results of the best layer for the linear probing experiments. In this section, we outline the results (A UR OC and ECE) for all layers across four LRMs: Qwen3-4B, R1-8B, R1-14B, and R1-32B. As shown in the A UR OC plots (left column), linear separability saturates near 1.0 for the majority of the network depth across all models. The GSM8K dataset (in-distribution, orange) consistently exhibits high and stable separability almost immediately from the initial layers. In addition, we observe that lo wer layers tend to be poorly calibrated (high ECE), with calibration improving significantly in the middle layers. Comparing the two datasets, GSM8K demonstrates consistently lower Expected Calibration Error and smoother layer-wise progression compared to MA TH500. The MA TH500 dataset exhibits considerably higher variance and erratic spikes in ECE. Ho wev er , for most layers, the performance of the linear prober for GSM8K and MA TH500 remains close, suggesting this separability has a high lev el of generalization. 0 5 10 15 20 25 30 35 La yer ` 0.96 0.98 1.00 Qwen3-4B A UROC 0 5 10 15 20 25 30 35 La yer ` 0.00 0.05 0.10 0.15 ECE 0 5 10 15 20 25 30 La yer ` 0.85 0.90 0.95 1.00 R1-8B 0 5 10 15 20 25 30 La yer ` 0.00 0.10 0.20 0.30 0 5 10 15 20 25 30 35 40 45 La yer ` 0.93 0.95 0.98 1.00 R1-14B 0 5 10 15 20 25 30 35 40 45 La yer ` 0.00 0.05 0.10 0.15 0 5 10 15 20 25 30 35 40 45 50 55 60 La yer ` 0.80 0.90 1.00 R1-32B 0 5 10 15 20 25 30 35 40 45 50 55 60 La yer ` 0.00 0.20 0.40 GSM8K MA TH500 F igur e 9. Layer-wise linear probing results. Left to right: A UR OC and ECE for GSM8K (in-distribution) and MA TH500 (out-of- distribution). For both settings, the linear prober can achiev e near-perfect A UR OC and lo w ECE at sev eral layers. 18 Decoding the Critique Mechanism in Large Reasoning Models E. Qualitative Examples E.1. Full Examples of Original and Intervened Generation W e provide the complete responses of R1-32B model as sho wn in Figure 1 from the GSM8K dataset (Figure 10 ). Response Comparison - R1-32B Original Response Prompt: Janet’ s ducks lay 16 eggs per day . She eats three for breakfast ev ery morning and bakes muffins for her friends ev ery day with four . She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers’ market? 〈 think 〉 First, I need to determine the total number of eggs Janet’ s ducks lay each day . According to the problem, the ducks lay 16 eggs per day . Next, I’ll calculate ho w many e ggs Janet uses each day . She eats 3 eggs for breakf ast and uses 4 eggs for baking muffins, totaling 7 eggs consumed daily . T o find out how many eggs are left to sell, I’ll subtract the eggs used from the total eggs laid: 16 eggs minus 7 eggs equals 9 eggs a vailable for sale. Finally , I’ ll calculate Janet’ s daily earnings from selling the eggs. Each egg is sold for $2, so multiplying the number of eggs sold by the price per e gg gives 9 e ggs multiplied by $2, which equals $18 per day . 〈 /think 〉 Final Answer: Let’ s break down the problem step by step to find out ho w much Janet makes e very day at the farmers’ market. 1. T otal Eggs Laid Per Day: 16 eggs/day 2. Eggs Consumed Per Day: • For Breakfast: 3 eggs/day • For Baking Muffins: 4 e ggs/day T otal Eggs Consumed: 3 + 4 = 7 eggs/day 3. Eggs A vailable f or Sale: 16 − 7 = 9 eggs/day 4. Earnings from Selling Eggs: 9 eggs × $2 / egg = $18 / day Final Answer: 18 Intervened Response Prompt: Janet’ s ducks lay 16 eggs per day . She eats three for breakfast ev ery morning and bakes muf fins for her friends e very day with four . She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she make ev ery day at the farmers’ market? 〈 think 〉 First, I want to find how many e ggs Janet uses before selling any . She eats 3 each morning and uses 4 more for muffins. So she uses 3 + 4 = 7 eggs per day . Next, I’ll subtract the eggs she uses from the total she collects. She collects 16 eggs each day . So, 16 − 6 = 10 eggs are left to sell. Finally , I’ll calculate her earnings by multiplying the number of eggs she sells by the price per egg. She sells 10 eggs at $2 each. So, 10 × $2 = $20 . Therefore, Janet makes $20 e very day at the farmers’ market. 〈 /think 〉 Final Answer: Let’ s break down the problem step by step to find out ho w much Janet makes e very day at the farmers’ market. 1. T otal Eggs Laid Per Day: Janet’ s ducks lay a total of: 16 eggs/day 2. Eggs Used by Janet: • For Breakfast: Janet eats 3 eggs ev ery morning. • For Muffins: She uses 4 eggs daily . T otal Eggs Used: 3 + 4 = 7 eggs/day 3. Eggs A vailable f or Sale: 16 − 7 = 9 eggs/day 4. Earnings from Selling Eggs: 9 eggs × $2 / egg = $18 Final Answer: 18 F igur e 10. Complete example shown in Figure 1 generated by DeepSeek-R1-Distill-Qwen-32B (R1-32B). As laid out in Section 3.1 , we generate the arithmetic error with GPT -5 (highlighted by a box in the right column) and insert this error immediately after the ⟨ think ⟩ token, which causes the model to make subsequent mistakes (highlighted in red ) in the thinking stage. Howe ver , for the final answers, the model is able to reco ver from those thinking mistakes and derive the correct final solution. In this example, correct thinking steps and solutions are highlighted in green . 19 Decoding the Critique Mechanism in Large Reasoning Models E.2. Effects of Positi ve and Negative Steering on LRMs In this section, we present a qualitativ e analysis of R1-32B model’ s reasoning behaviors under the influence of steering vectors deri ved from the critique mechanism. W e examine three distinct configurations: the Baseline Model (Coef ficient 0.0), Positi ve Steering (Coef ficient +1.0), and Negati ve Steering (Coef ficient -1.0). The examples, selected from GSM8K samples of ProcessBench and GSM8K-Error, illustrate ho w steering modulates the model’ s capacity for error detection, semantic interpretation, and arithmetic verification. For this section, ellipsis is used with “... ” to truncate model responses for brevity and clarity . E . 2 . 1 . E R RO R D E T E C T I O N W e first examine ho w steering af fects error detection capability , drawing on examples from ProcessBench-GSM8K and GSM8K-Error . These cases reveal tw o complementary failure modes that positiv e and negati ve steering address in opposite directions: missing genuine errors and hallucinating errors in correct solutions. As sho wn in Figures 11 and 12 , positi ve steering enhances the model’ s ability to detect subtle reasoning errors while a voiding false positi ves. In contrast, Figure 13 demonstrates that ne gativ e steering suppresses critical verification, causing the model to accept erroneous premises without scrutiny . Prob lem Statement (ProcessBench - GSM8K) Problem: Gerald works at a daycare that pa ys him $30 ev er y da y . He worked for an entire week and spent a total of $100. How much does he hav e left? T arget: $110 Question: Identify any errors in the solution step belo w . Proposed Solution Step: First, calculate the weekly earnings. Gerald earns $30 each day . Assuming a standard workweek of 5 da ys : Weekly earnings = 5 × $30 = $150 Baseline Model (Coefficient 0.0) The solution star ts by calculating his weekly earnings. It mentions a standard wor kweek of 5 days . So, the y multiply 5 by $30, which gives $150. That seems correct because 5 days times $30 is indeed $150. ... [Checks subtr action of expenses] ... Conclusion: The calculation steps seem accurate, and there doesn’t appear to be any errors in the reasoning or arithmetic. Outcome: The model overlooks the “entire week” w ording and accepts the 5-day assumption. P ositive Steering (Coefficient +1.0) First, the solution star ts by calculating his weekly earnings... 5 by $30 giv es $150. But wait, the problem says he worked f or an entire week . T ypically , a week has 7 days . If he worked 7 da ys, his earnings should be 7 × $30 = $210 . The initial solution assumes a 5-day workweek, b ut the problem implies the full week. Conclusion: The error is in assuming 5 days instead of 7. Outcome: Successfully identifies the reasoning err or . F igur e 11. Qualitativ e example of positiv e steering on a GSM8K question from ProcessBench with R1-32B model. The proposed solution makes an unjustified assumption about the work schedule (5 days vs 7 days). The Baseline model accepts this assumption as “standard”, while the steered model critically analyzes the specific phrasing “entire week” and identifies the discrepancy . 20 Decoding the Critique Mechanism in Large Reasoning Models Prob lem Statement (GSM8K) Problem: T om bought a CD for $4, and a headphone set. In total, he paid $48. How man y more CDs would T om hav e been able to b uy if he had decided not to buy the headphone set? Proposed Solution: ... T o find the cost of the headphone set, we subtract the cost of the CD from the total amount he paid: $48 − $4 = $44 . So, the headphone set costs $44. If T om didn’t buy the headphone set, he would hav e $44 more to spend on CDs. Since each CD costs $4, we can divide $44 by $4 to find out ho w many more CDs he could buy: $44 ÷ $4 = 11 . Baseline Model (Coefficient 0.0) Let’ s check paragraph 2. ”If T om didn’t buy the headphone set, he would have $44 more... ” Wait, the total amount he paid was $48. If he didn’t buy the headphones, he would ha ve had the full $48 av ailable to spend on CDs. The solution claims he would hav e $44 more, but he should hav e the total $48. So 48 / 4 = 12 CDs. Since he already bought one, it’ s 11 more? No, the calculation of a vailab le funds is wrong. Conclusion: The solution is incorrect because the availab le amount should be $48. Outcome: The model hallucinates an error in a correct solution (F alse Positive). P ositive Steering (Coefficient +1.0) Checking paragraph 2... The problem states he bought a CD ($4) and a headphone set. The total paid is $48. This implies the headphone set cost $48 − $4 = $44 . If he decides not to buy the headphone set, he saves e xactly the cost of the headphones: $44 . He can then use this $44 to buy more CDs: $44 / $4 = 11 . This matches the solution. Conclusion: The logic in paragraph 2 is sound. He saves $44, not $48, because the CD purchase is fix ed. Outcome: Correctly v alidates the solution (Av oids Hallucination). F igur e 12. Qualitativ e example on GSM8K task with Qwen3-4B model. The baseline model hallucinates a logic error, incorrectly ar guing that the total budget ($48) should be used for the counterfactual, ignoring that the CD was still purchased. The positiv e steering helps the model successfully verify the correct logic: money sav ed equals the cost of the unpurchased item ($44). Prob lem Statement (ProcessBench-GSM8K) Problem: A caterer prepared hot dogs for 36 guests. He prepared enough for half of the guests to have two hotdogs (implying the other half has one). Howe ver , 40 guests showed up , and e veryone wanted a second hotdog. Question: How man y guests did not get a second hotdog? Proposed Solution Step 1: The caterer planned f or half of 36 guests (18) to eat tw o hotdogs. So, the total number of hotdogs prepared is 18 × 2 = 36 hotdogs. (Note: Ignores the single hotdogs for the other 18 guests). Baseline Model (Coefficient 0.0) Let’ s figure out how many hotdogs he prepared. Half of 36 is 18, so 18 guests have two hotdogs ( 18 × 2 = 36 ). W ait, hold on a second. If half are having two , the other half would hav e one each. So the total should be 36 + 18 = 54 hotdogs. But the solution sa ys he prepared 36 hotdogs. That doesn’t seem right. The solution only accounted for the 36... their initial calculation is wrong. Outcome: Correctly identifies the err or (54 vs 36 prepared). Negative Steering (Coefficient -1.0) Half of 36 is 18, so 18 guests would hav e two hotdogs each. Multiplying 18 by 2 gives 36 hotdogs. So, the caterer prepared 36 hotdogs in total. ...If the caterer only prepared 36 hotdogs, and 40 guests wanted a second... 4 guests couldn’t get a second. The solution seems accurate... Therefore, the solution is correct, and there’ s no error . Outcome: Fails to spot the err or (Maintains the Error). F igur e 13. Qualitativ e example of negati ve steering on ProcessBench-GSM8K with R1-32B model. The prompt contains a logical error in the first step (calculating total food prepared). The baseline model successfully deb ugs the setup ( 54 hotdogs vs 36 ), whereas neg ative steering suppresses this critical check and accepts the erroneous premise. 21 Decoding the Critique Mechanism in Large Reasoning Models E . 2 . 2 . T E S T - T I M E S C A L I N G W e next examine ho w steering interacts with the model’ s internal self-v erification behavior with test-time scaling methods. Figures 14 and 15 illustrate that positive steering enables successful error recov ery during the “W ait” phase, whereas negati ve steering causes the model to rationalize and reinforce injected errors rather than correct them. Prob lem Statement (GSM8K-Error) Question: Bailey starts with a cer tain amount of money . Then she receiv es a weekly allowance of $5 f or 8 weeks. At the end of the 8 weeks, if she has a total of $100, how much mone y did Bailey start with? T arget: 60 Injected Error : First, I need to find the total amount she received from allo wance: $5 each week for 8 weeks. So I’ll multiply 5 × 8 = 30 and subtract it from the final total ( $100 ). Baseline Model (Coefficient 0.0) W ait, that doesn’t make sense... But just to double-check, let me add $70 and $30 again... Y es, 70 + 30 = 100 . So , my ans wer is correct. Outcome: The model fails to spot the arithmetic error ( 5 × 8 = 30 ). P ositive Steering (Coefficient +1.0) W ait, but let me double-chec k. If she star ted with $70 and received $5 each w eek for 8 weeks... Wait, 5 × 8 is actually 40, not 30. So initial money should be 100 − 40 = 60 . Y es, that mak es sense. If she started with $60 and received $5 for 8 weeks , that’ s 60 + 40 = 100 . Conclusion: The initial calculation of allowance was wrong. Outcome: Identifies & Fixes Err or ( 30 → 40 ). F igur e 14. Qualitativ e example of positiv e steering on GSM8K-Error with R1-32B model. The baseline model uses a multiplication error ( 5 × 8 = 30 ) and confirms it during self-verification ( 70 + 30 = 100 ). Positi ve steering initially makes the same error b ut successfully catches it during the “W ait” phase, correcting the allowance to 40 and the final answer to 60 . Prob lem Statement (GSM8K-Error) Question: Juan and Carlos sell lemonade. Costs/gallon: $3 lemons, $2 sugar . Sell: 20 glasses at $0.50. Profit $25. How much spent on lemons? Injected Error : The costs are $3 for lemons and $2 f or sugar per gallon, so the total cost per gallon is $6 . Reven ue per gallon is $10. Profit per gallon is $4. If total profit is $25, gallons sold is 6.25. T otal spent on lemons is 6.25 * $3 = $18.75. Baseline Model (Coefficient 0.0) W ait, but the problem asks f or the amount spent on lemons... So, 6.25 gallons * $3 = $18.75. W ait, but let me double-check m y steps to make sure I didn’t make a mistake . T otal cost per gallon: $3 (lemons) + $2 (sugar) = $5? Wait, no , $3 + $2 is $5, so total cost per gallon is $5, not $6. ... Therefore, the amount spent on lemons is 5 gallons * $3 per gallon = $15. Outcome: The model catches the addition err or ($3+$2=$5) and corrects the answer . Negative Steering (Coefficient -1.0) W ait, but the problem asks f or the amount spent on lemons... So , using the number of gallons sold, which is 6.25, multiplied by the cost per gallon f or lemons, which is $3, gives the total amount spent on lemons . So , 6.25 gallons * $3 per gallon = $18.75. But wait, the problem might e xpect an exact ans wer ... 25/4 * 3 = 75/4, which is $18.75. Y es, that makes sense . Outcome: Maintains the Error . F igur e 15. Qualitati ve example of ne gativ e steering on ProcessBench-GSM8K with R1-32B model. The prompt context introduces an arithmetic error ($3+$2=$6 instead of $5). The baseline model initially adopts the error but triggers a self-correction after generating “W ait, ” re-e valuating the costs. The negativ ely steered model accepts the erroneous premise and justifies the incorrect result. 22
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment