Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models

Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .


💡 Research Summary

The paper identifies a critical weakness in current large Vision‑Language Models (VLMs): despite impressive performance on many multimodal tasks, they struggle with basic visual perception and reasoning, as evidenced by near‑random accuracy on even the simplest 2 × 2 jigsaw puzzles. The authors argue that this deficiency stems from the scarcity of high‑quality multimodal reinforcement‑learning (RL) data and the inability of conventional pre‑training/fine‑tuning pipelines to teach structured, step‑by‑step problem solving.

To address this, they introduce AGILE (Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs). AGILE reframes jigsaw solving as an interactive process where a VLM repeatedly generates executable Python code to manipulate a simulated environment. At each iteration the model can (1) swap any two puzzle pieces, (2) observe the current layout, (3) crop a region for closer inspection, or (4) zoom into a region for fine‑grained analysis. The environment executes the code, returns a visual observation, and the model uses this feedback to decide the next action. This loop creates a closed‑feedback system that forces the model to develop both low‑level visual discrimination (identifying which piece belongs where) and higher‑level logical planning (deciding an efficient sequence of swaps).

Data generation proceeds in two stages. First, a “cold‑start” dataset of 1.6 K high‑quality expert trajectories is collected using Gemini 2.5 Pro, which interacts with the environment under carefully crafted prompts. The trajectories are filtered for correctness and balanced across action types and step counts (4–8 steps). This stage equips the target model (e.g., Qwen‑2.5‑VL‑7B) with basic instruction following and code‑generation abilities.

Second, the model is fine‑tuned with reinforcement learning using a novel Group Relative Policy Optimization (GRPO) algorithm. GRPO samples a group of trajectories for each input, computes the average reward of the group as a baseline, and updates the policy by maximizing a clipped surrogate objective. The reward function combines three components: (a) an accuracy reward (1 if the final puzzle is solved, 0 otherwise), (b) a format reward (1 if the model’s output follows the required , , tags), and (c) a step reward that encourages solving the puzzle in the minimal number of swaps (for 2 × 2, at most three swaps). The step reward is only applied when the puzzle is correctly solved, preventing the model from gaming the metric early in training.

Experimental results are striking. On the 2 × 2 jigsaw, AGILE raises accuracy from 9.5 % to 82.8 %; similar gains are observed for larger grids (3 × 3, 4 × 4). Beyond the proxy task, the AGILE‑pre‑trained model exhibits improved performance on nine downstream vision benchmarks covering high‑resolution image understanding, real‑world scene analysis, fine‑grained classification, visual reasoning, and hallucination detection, achieving an average boost of 3.1 %. Scaling the amount of generated jigsaw data yields near‑linear performance improvements, and under equal data budgets, jigsaw‑based pre‑training matches or exceeds the results of conventional QA‑style pre‑training.

The paper’s contributions are threefold: (1) the AGILE framework that casts a visual puzzle as a stepwise, code‑driven interaction, thereby fostering incremental perception and reasoning improvements; (2) a scalable, controllable jigsaw data synthesis pipeline that produces high‑quality multimodal RL trajectories without human annotation; (3) the GRPO reinforcement‑learning scheme that efficiently optimizes the policy in a multi‑sample setting.

In summary, AGILE demonstrates that embedding VLMs in an interactive, feedback‑rich environment can substantially close the gap in foundational visual perception and reasoning. The approach opens a promising direction for future work, such as extending to more complex visual puzzles (e.g., 3‑D or dynamic scenes), integrating external tools (search engines, drawing APIs), and exploring richer reward structures to further enhance multimodal reasoning capabilities.


Comments & Academic Discussion

Loading comments...

Leave a Comment