Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning
Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain-of-thought-style intermediate reasoning. Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal, leaving open the question of which properties of code actually contribute to improved reasoning. To address this gap, we study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi-step reasoning during fine-tuning. We examine two complementary settings: solution-driven complexity, where complexity varies across multiple solutions to the same problem, and problem-driven complexity, where complexity reflects variation in the underlying tasks. Using cyclomatic complexity and logical lines of code to construct controlled fine-tuning datasets, we evaluate a range of open-weight LLMs on diverse reasoning benchmarks. Our findings show that although code can improve reasoning, structural properties strongly determine its usefulness. In 83% of experiments, restricting fine-tuning data to a specific structural complexity range outperforms training on structurally diverse code, pointing to a data-centric path for improving reasoning beyond scaling.
💡 Research Summary
The paper investigates how the structural complexity of code used for fine‑tuning influences the reasoning abilities of large language models (LLMs). While prior work has shown that exposure to code can improve multi‑step reasoning, it has treated code as a monolithic training signal. This study instead quantifies two concrete complexity metrics—cyclomatic complexity (CC), which measures the number of independent execution paths, and logical lines of code (LLoc), which counts executable statements independent of comments or formatting. By controlling these metrics, the authors construct two complementary datasets.
The “solution‑driven” dataset is derived from CodeNet, a large repository of competitive programming problems with many independent solutions per problem. For each problem‑language pair, the authors compute CC and LLoc for all available solutions, then select five representative solutions spanning the spectrum from minimal to maximal complexity (MIN, LOW, MID, HIGH, MAX). A control set (CTRL) samples uniformly across all complexity levels. This yields twelve splits (six per metric) each containing 8,087 samples.
The “problem‑driven” dataset aggregates three high‑quality instruction‑response code corpora (Magicoder, Evol‑Instruct, WizardLM). These datasets pair a natural‑language instruction with a single reference solution, so the difficulty of the underlying task naturally correlates with code complexity. After extracting Python, JavaScript, and Java snippets and computing the same two metrics, the authors again partition the data into five complexity bins per language, balancing the sample size to match the solution‑driven splits and adding a CTRL set.
To isolate the effect of code exposure from fine‑tuning per se, a natural‑language baseline is created by sampling an equal‑size subset from the ShareGPT corpus used in earlier code‑reasoning studies.
Six open‑weight LLMs (Llama‑2‑7B, Llama‑2‑13B, Mistral‑7B, Falcon‑7B, and two variants) are fine‑tuned for a single epoch on each split. Downstream reasoning performance is evaluated on six widely‑used benchmarks covering arithmetic (GSM‑8K), mathematical problem solving (MATH), multi‑disciplinary reasoning (BIG‑BENCH), commonsense (ARC‑AGI), and code‑related reasoning (HumanEval).
Key findings:
- Code fine‑tuning does improve reasoning, but gains are highly variable across model sizes, architectures, and benchmarks.
- The relationship between code complexity and reasoning performance is non‑monotonic. Accuracy typically peaks at intermediate complexity (CC ≈ 5‑10, LLoc ≈ 15‑30) and declines for both very simple and very complex code.
- In 83 % of experiments, restricting fine‑tuning to a narrow complexity band outperforms training on a heterogeneous mix of complexities. The CTRL sets are rarely optimal.
These results suggest that the usefulness of code as a training signal is governed more by its structural properties than by sheer quantity or diversity. Consequently, a data‑centric approach—selecting or generating code that matches a model‑specific “sweet spot” of complexity—can yield larger reasoning improvements than indiscriminate scaling of code data. The authors release all complexity‑controlled datasets to facilitate reproducibility and further research into the interaction between code structure and LLM reasoning. Future directions include automated complexity‑aware data filtering pipelines, complexity‑conditioned data augmentation, and extending the analysis to other programming languages and domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment