Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scaling test-time compute via long Chain-ofThought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a 3x throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.


💡 Research Summary

The paper introduces “Accordion‑Thinking,” an end‑to‑end framework that equips large language models (LLMs) with the ability to self‑regulate the granularity of their reasoning by dynamically summarizing intermediate steps. Traditional Chain‑of‑Thought (CoT) reasoning generates long, unstructured token sequences; as the sequence grows, the key‑value (KV) cache expands linearly and the attention computation becomes quadratic, quickly exhausting GPU memory and limiting inference speed. Accordion‑Thinking tackles this bottleneck by interleaving detailed reasoning segments (dₖ) with concise step summaries (sₖ). After a summary sₖ₋₁ is produced, the corresponding detailed segment dₖ₋₁ is removed from the KV cache, and subsequent reasoning only attends to the input and the accumulated summaries. This “Fold” mode reduces the effective context size from O(∑|dᵢ| + ∑|sᵢ|) to O(∑|sᵢ|), dramatically cutting memory usage and attention cost while preserving logical continuity.

To teach the model this behavior, the authors build a synthetic “Accordion” dataset. They start from 10 k long‑form CoT examples (openr1‑math‑46k), rewrite each trace with a teacher LLM (DeepSeek‑V3.2) into a structured format that explicitly marks each reasoning step and its summary using tags, and filter out low‑quality rewrites based on structural integrity, step count (2–6 steps), detailed segment length (≤6 144 tokens), and summary length (≥100 tokens). The final training corpus contains roughly 14 k Fold‑mode examples.

Supervised fine‑tuning (SFT) aligns the model with the required markup but does not guarantee that summaries retain all necessary information. Therefore, the authors apply reinforcement learning (RL) to incentivize high‑quality, self‑contained summaries. They adopt Group Relative Policy Optimization (GRPO) with a clipped objective (no KL penalty) and define a binary trajectory‑level reward for final answer correctness. The advantage estimate is centered by subtracting the mean reward of a group of samples generated for the same query, encouraging the model to produce summaries that enable correct downstream reasoning even when earlier detailed steps are discarded.

Experiments compare three regimes across mathematics, programming, and logical puzzles: (1) standard Unfold mode (full context), (2) Fold mode with learned summarization, and (3) a mixed regime. Initially, Fold mode lags behind Unfold in accuracy, but as RL training proceeds the performance gap narrows and eventually vanishes—a phenomenon the authors call “Gap‑Vanishing.” This indicates that the model learns to compress essential reasoning state into the summaries effectively.

Through quantitative benchmarks, the Fold mode achieves roughly three times higher token‑per‑second throughput on a 48 GB GPU configuration while matching Unfold accuracy. Human evaluations confirm that the generated step summaries are coherent, semantically faithful, and often sufficient to infer the final answer without consulting the discarded detailed steps, thereby improving interpretability.

The paper’s contributions are threefold: (i) a data synthesis pipeline that teaches LLMs to produce and rely on step‑wise summaries, (ii) an RL‑based self‑compression objective that aligns reasoning and summarization skills, and (iii) empirical evidence that self‑compression can close the efficiency‑accuracy gap, delivering both speed and transparency.

Future directions include adaptive summary length control, hierarchical multi‑level summarization, extending the approach to non‑mathematical domains (e.g., legal or medical reasoning), and developing automatic metrics for summary fidelity. Accordion‑Thinking thus establishes a promising paradigm where LLMs not only think but also learn to think compactly, enabling scalable, readable, and resource‑efficient reasoning.


Comments & Academic Discussion

Loading comments...

Leave a Comment