Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

Reading time: 5 minute
...

📝 Original Info

  • Title: Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks
  • ArXiv ID: 2512.22255
  • Date: 2025-12-24
  • Authors: Abhranil Chandra, Ayush Agrawal, Arian Hosseini, Sebastian Fischmeister, Rishabh Agarwal, Navin Goyal, Aaron Courville

📝 Abstract

We present the surprising finding that a language model's reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model's own distribution, making it more amenable to learning. Second, these `incorrect' traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces -- shifting their distribution closer to the model's own distribution -- and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model's distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.

💡 Deep Analysis

Figure 1

📄 Full Content

Large language models (LLMs) have made rapid progress on reasoning tasks, often using supervised fine-tuning on chain-of-thought (CoT) traces (Guo et al., 2025a;Bercovich et al., 2025;Ye et al., 2025). A common assumption in this line of research is that correctness is the primary determinant of data quality: the more correct a dataset is, the better it should be for training (Ye et al., 2025;Muennighoff et al., 2025). This assumption has guided the construction of widely used reasoning datasets, which typically rely on heavy human annotation (Hendrycks et al., 2021a;Cobbe et al., 2021) or filtering of model outputs with rule-based verifiers, that validate the CoT traces via final-answer checking (Guo et al., 2025b;Zelikman et al., 2022).

In this work we consider three broad categories of reasoning datasets that could be used to finetune a model: (1) Human-annotated and written traces -fully correct and carefully verified. These are treated as the gold standard. However, these traces may be far from the model’s distribution. (2) Synthetic traces generated from more capable models, typically from the same family, leading to correct answers -often filtered using rule-based verifiers that check only the final solution. These traces are closer to the distribution of the model being finetuned, but their reasoning steps may still be partially flawed. (3) Synthetic traces generated from more capable models, typically from the same family, leading to incorrect answers -generally discarded in existing pipelines. Yet these Categories (1) and ( 2) are consistently favored in prior work Ye et al. (2025); Guo et al. (2025b); Zelikman et al. (2022). Human-annotated data is trusted for correctness, while model-generated correct-answer traces provide scale. Category (3), however, is largely ignored under the assumption that incorrect answers imply poor reasoning. Works such as (Setlur et al., 2024;Aygün et al., 2021) propose to utilize incorrect traces for training better models in a contrastive setting or to train better verifiers. However, whether these traces can directly be useful for improving math reasoning has not been thoroughly tested.

We therefore pose two questions: (1) Can model-generated CoT traces that lead to incorrect final answers still directly help models learn to reason better -and if so, why? (2) Should we prioritize fully correct human-written traces that may lie further from the model’s output distribution, or model-generated traces that are closer to this distribution -even if they are imperfect?

To investigate the above questions, we conduct a systematic study of supervised fine-tuning (SFT) on reasoning traces across categories (1), (2), and (3) above. Our experiments cover multiple reasoning benchmarks -MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), Countdown (Pan et al., 2025), and MBPP (Austin et al., 2021) -and three model families ranging from 1.5B to 9B parameters (Gemma (Team et al., 2024), Llama (Grattafiori et al., 2024), Qwen (Qwen et al., 2025)). We generate reasoning traces for Category (2) and Category (3) using the same or more capable language models (LMs). Surprisingly, we find that training with Category (3) data can improve reasoning performance, even more than using human-written correct traces. We also show that paraphrasing human solutions with an LLM can also improve performance by bringing them closer to the model’s distribution. Finally, we design experiments to introduce completely flawed CoTs in our datasets successively to reveal how tolerant models are to errors before the performance starts to diminish.

While recent approaches prioritize correctness, our results highlight an underexplored dimension: closeness to the model’s distribution can matter as much as, or even more than, correctness.

We summarize the main contributions of our work as follows:

• We show that model-generated CoT traces leading to incorrect final answers (Category 3), which are typically discarded, can improve reasoning performance when used for supervised fine-tuning.

• We demonstrate that training data closer to the model’s output distribution, even if imperfect, can be more effective than correct human-written traces that might be further from model’s distribution. • We progressively degrade CoTs to quantify how much incorrect reasoning a model tolerates, before performance degrades -providing insights into the robustness of learning from imperfect data. • We paraphrase human data to better match the model distribution, improving reasoning scores.

• Qualitatively, we analyse CoT traces generated from the models and show that final answer checking is not the most reliable or holistic way to evaluate CoTs.

In this section, we provide the background on improving reasoning in LLMs. We begin by formalizing LLMs and Supervised Fine-Tuning (SFT) as the learning paradigm. We then show how large neural networks can learn from noisy data and, finally, draw a parallel to the regulariz

📸 Image Gallery

G-2B_Accuracy_on_Countdown_Test_Set.png G-2B_Countdown_Test_acc.png G-2B_GSM8K_Test_acc.png G-2B_MATH500_Test_acc_BS256_ALL.png G-2B_MATH500_Test_acc_BS256_ALL_New.png G-2B_MATH500_Test_acc_BS256_H_G27B_SD.png G-2B_MATH500_Test_acc_BS256_H_G27B_SD_New.png G-2B_MATH500_Test_acc_H_G27B_G_W.png G-2B_MBPP200_Test_acc_BS64_ALL_with_H_Para.png G-2B_MBPP200_Test_acc_BS64_H_vs_H_Para.png G-2B_MBPP_Training_Loss_BS64_H_vs_H_Para.png G-2B_training_losses_64BS_combined.png G-2B_training_losses_GSM8K.png G-2B_training_losses_error_intro.png G-2B_training_losses_error_intro_all.png G2B_GSM8K_with_subsets_64BS.png G2B_Loss_All_MBPP_Code.png G2B_MATH500_analysis_64BS.png G2B_MATH500_paraphrased_Accuracy.png G2B_MBPP200_Test_Acc_H_G_overlap_601.png G2B_gsm8K_entiredataset_analysis_64BS.png G2B_loss_Math_H_G_W.png L-8B_MBPP200_Test_acc_BS64_H_G_overlap.png L-8B_training_losses_GSM8K.png L8B_Countdown_Test_acc.png L8B_GSM8K_Test_acc.png L8B_GSM8K_Test_sebset_acc.png L8B_H_G_W_MBPP_code.png L8B_H_G_overlap_MBPP_Code.png L8B_H_H_Para_G_W_all_MBPP_Code.png L8B_H_vs_H_Para_MBPP_Code.png L8B_Loss_All_MBPP_Code.png L8B_Loss_H_G_W_MBPP_Code.png L8B_Loss_H_vs_H_Para_MBPP_Code.png L8B_MATH500_Test_acc.png L8B_MATH500_paraphrased_Accuracy.png L8B_MATH7500_Train_acc.png L8B_MBPP_H_W_G.png L_8B_MATH_loss_plot.png MBPP_Test_accuracy.png MBPP_Training_Loss.png Q-1.5B_Countdown_Test_acc.png Q-1.5B_GSM8K_Test_acc.png Q-1.5B_MATH500_Test_acc.png Q-1.5B_MATH500_Test_acc_FULL.png Q1.5B_H_G_W_MBPP_Code.png Q1.5B_H_G_overalp_MBPP_Code.png Q1.5B_H_H_Para_G_W_All_MBPP_Code.png Q1.5B_H_vs_H_Para_MBPP_Code.png Q1.5B_Loss_All_MBPP_Code.png Q1.5B_Loss_H_G_W_MBPP_Code.png Q1.5B_Loss_H_vs_H_Para_MBPP_Code.png Q1.5B_MATH_loss_plot_dense.png Q1.5B_training_losses_GSM8K.png Shape_of_Thought.png combined_accuracy_vs_checkpoint_Gemma2B_GSM8K_Test_final.png combined_accuracy_vs_checkpoint_Gemma2B_Math500_final.png combined_accuracy_vs_checkpoint_Llama8B_Countdown_Test_final.png combined_accuracy_vs_checkpoint_Llama8B_GSM8K_Test_final.png combined_accuracy_vs_checkpoint_Llama8B_Math500_final.png combined_accuracy_vs_checkpoint_Qwen1.5B_Countdown_Test_final.png combined_accuracy_vs_checkpoint_Qwen1.5B_GSM8K_Test_final.png combined_accuracy_vs_checkpoint_Qwen1.5B_GSM8K_Test_final_Ayush_style.png combined_accuracy_vs_checkpoint_Qwen1.5B_Math500_final.png error_intro_results_acc_gemma_2_2b_64BS.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut