CoLT: Reasoning with Chain of Latent Tool Calls

CoLT: Reasoning with Chain of Latent Tool Calls
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls’’. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.


💡 Research Summary

The paper introduces CoLT (Chain‑of‑Latent‑Tools), a framework that enables large language models (LLMs) to perform chain‑of‑thought (CoT) reasoning more efficiently by offloading compressed reasoning steps to external, lightweight decoders. Traditional explicit CoT generates each reasoning token sequentially, leading to high inference cost. Implicit latent CoT methods reduce token length but require substantial model architecture changes and extensive retraining, limiting their applicability.

CoLT addresses these issues by having the main LLM generate special “seed tokens” that embed condensed information about a reasoning step in their hidden states. Two token types are defined: (body tokens) that carry the latent embeddings, and (trigger tokens) that indicate which decoder should be invoked. When a latent tool call is triggered, the final‑layer hidden states of the seed tokens (H) are extracted, linearly projected (P_D) into the input space of a chosen decoder D, and the decoder autoregressively generates the explicit text tokens R that correspond to the original reasoning step. The generated text is concatenated to the ongoing context, and the main LLM continues reasoning. Because the entire pipeline is differentiable, gradients flow from the decoder back to the main model, allowing joint optimization.

Training consists of two supervised losses: L_main, encouraging the main model to emit correct seed tokens, and L_lat, a cross‑entropy loss on the decoder’s output. The total supervised loss is L_sup = L_main + L_lat. To go beyond gold CoT supervision, the authors also apply reinforcement learning using Group Relative Policy Optimization (GRPO). By sampling both main‑model outputs and decoder outputs, multiple reasoning trajectories are generated for each question; a reward based on answer correctness (1 for correct, 0.1 for correct format, 0 otherwise) guides policy updates, with a KL‑penalty for stability.

Experiments are conducted on four math reasoning benchmarks: GSM8K‑Aug (an expanded version of GSM8K with ~385k training examples), GSM8K‑Hard, SVAMP, and MultiArith. CoLT is evaluated with one‑seed and two‑seed configurations and compared against strong baselines such as Coconut, CODI, COLAR (with various compression ratios), SIM‑CoT, and standard CoT. Results show that CoLT achieves higher accuracy while reducing reasoning chain length. For example, on GSM8K‑Aug, CoLT (2‑seed) reaches 45.5 % accuracy with a 10.84 token reduction, outperforming COLAR (5×) which attains 42.2 % accuracy with a 13.2 token reduction. Reinforcement learning further improves performance on the harder out‑of‑domain datasets.

The authors also explore alternative decoder architectures, including multi‑hot decoders, demonstrating that the framework is flexible and not tied to a specific decoder design. Ablation studies confirm that the number of seed tokens and the choice of decoder affect the trade‑off between compression and accuracy, but reasonable defaults work well across datasets.

Limitations are acknowledged: the optimal seed‑token length and decoder selection require hyper‑parameter tuning; the current evaluation focuses on mathematical problem solving, so generalization to other domains (e.g., code generation, commonsense reasoning) remains to be validated; and maintaining separate decoder modules incurs additional memory overhead, though this is offset by the reduced token generation cost.

In summary, CoLT proposes a novel “latent tool call” paradigm that preserves the explicit‑text reasoning capabilities of pretrained LLMs while delegating compressed reasoning steps to smaller, efficient decoders. This approach yields both computational savings and accuracy gains, offering a practical pathway to deploy powerful LLMs in resource‑constrained settings without extensive model redesign or retraining.


Comments & Academic Discussion

Loading comments...

Leave a Comment