Computer Science / Artificial Intelligence

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

December 31, 2025

Reading time: 6 minute

...

#Model #Artificial Intelligence #System #Computer Science #Learning

📝 Original Info

Title: Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
ArXiv ID: 2512.24873
Date: 2025-12-31
Authors: Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Xin Lin, Chonghuan Liu, ZhenDong Liu, Zhiqiang Lv, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng

📝 Abstract

Agentic crafting, unlike one-shot response generation for simple tasks, requires LLMs to operate in real-world environments over multiple turns-taking actions, observing outcomes, and iteratively refining artifacts until complex requirements are satisfied. Yet the spirit of agentic crafting reaches beyond code, into broader tool-and languagemediated workflows where models must plan, execute, and remain reliable under interaction. Reaching this new regime demands sustained, painstaking effort to build an agentic ecosystem as the foundational bedrock, ultimately culminating in an agent model as the capstone. ROME wasn't built in a day. A principled, end-to-end agentic ecosystem can streamline the development of the agent LLMs from training to production deployment, accelerating the broader transition into the agent era. However, the opensource community still lacks such an ecosystem, which has hindered both practical development and production adoption of agents. To this end, we introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the end-to-end production pipeline for agent LLMs. ALE consists of three system components. ROLL is a post-training framework for weight optimization. ROCK is a sandbox environment manager that orchestrates environments for trajectory generation. iFlow CLI is an agent framework that enables configurable and efficient context engineering for environment interaction. We release ROME (ROME is Obviously an Agentic ModEl), an open-source agent grounded by ALE and trained on over one million trajectories. In addition, we curate a suite of data composition protocols that synthesize data spanning isolated, static snippets to dynamic, complex agentic behaviors, with built-in verification of safety, security, and validity. We further develop an end-to-end training pipeline and propose a novel policy optimization algorithm IPA, which assigns credit over semantic interaction chunks rather than individual tokens, improving training stability over long horizons. Empirical evaluations show that ROME achieves strong results across mainstream agentic benchmarks, including 24.72% on Terminal-Bench 2.0 and 57.40% accuracy on SWE-bench Verified, outperforming similarly sized models and rivaling those with over 100B parameters. To enable more rigorous evaluation, we introduce Terminal Bench Pro, a benchmark with improved scale, domain coverage, and contamination control. ROME still demonstrates competitive performance among open-source models of similar scale and has been successfully deployed in production, demonstrating the practical effectiveness of the ALE.

💡 Deep Analysis

📄 Full Content

Recent years have witnessed a transformative wave in software engineering driven by large language models (LLMs) (Hou et al., 2024). Early efforts largely cast LLMs as one-shot generators, emitting static responses to a single prompt (Jiang et al., 2025;Allamanis et al., 2018;Hou et al., 2024). Yet this paradigm provides limited iterative reasoning and lacks grounded feedback loops, rendering it ill-suited for complex, end-to-end workflows. Accordingly, the frontier of LLM-based workflow-driven task (e.g., software engineering) is shifting toward the agentic crafting1 paradigm, which enables LLMs to plan, execute, and self-correct through multi-turn interactions with environments, spanning software repositories, terminals and broader tool-and language-mediated workflows in the real world (Ning et al., 2025;Ye et al., 2025;Wang et al., 2025e;Gao et al., 2023;Novikov et al., 2025).

However, the widespread practical adoption of agentic crafting remains elusive in the absence of a scalable, end-to-end agentic ecosystem. Prior work has sought to improve agentic crafting via supervised fine-tuning (SFT) on limited human demonstrations (Emergent Mind, 2025;Wang et al., 2025a), or through ad-hoc reinforcement learning (RL) recipes that are often struggles with long-horizon tasks and sparse, delayed rewards (Luo et al., 2025;Tan et al., 2025;Wang et al., 2025a). In this report, we contend that a principled agentic ecosystem must close the loop spanning data generation, agent execution, and policy optimization, enabling an continuous end-to-end optimization workflow that can adapt to distribution shift and growing complexity in production environments. To bridge this gap, we present the Agentic Learning Ecosystem (ALE), a full-stack infrastructure that unifies data, training, and deployment for agentic intelligence. Concretely, ALE comprises three synergistic system components: ROLL (Reinforcement Learning Optimization for Large-Scale Learning): A scalable RL training framework supporting multi-environment rollouts, chunk-aware credit assignment, and stable policy updates for long-horizon agentic tasks.

ROCK (Reinforcement Open Construction Kit): A secure, sandboxed agent execution platform that provides executable, tool-grounded environments, supporting interaction trajectory synthesis, execution, and validation. iFlow CLI: An agent framework that orchestrates structured prompt suites for environment interaction, coupled with a user-facing interface that packages agents for real-world workflows and exposes APIs for continuous refinement via user feedback.

Grounded in ALE, we incubate ROME as an open-source agent LLM based on Qwen3-MoE, tightly developed within our established ecosystem. Along the road to ROME, we take two deliberate steps. First, we establish a curated, coherent data composition workflow that synthesizes multi-source, multilingual, tool-grounded trajectories. Benefiting from strong sandbox isolation and fine-grained permission control of ROCK, we run rigorous security, safety, and validity verification to ensure the integrity and quality of the generated trajectories. Second, we leverage millions of high-quality trajectories to iteratively refine an efficient, stage-wise training pipeline from continuous pre-training, SFT, to RL. Enabled by the tight integration of our ecosystem, the end-to-end training pipeline remains both high-throughput, resourceefficient, and user-friendly. To further stabilize RL training dynamics, we propose Interaction-Perceptive Agentic Policy Optimization (IPA), a novel algorithm that optimizes policies over semantic interaction chunks (Li et al., 2025). By shifting credit assignment from tokens to semantically meaningful chunks, IPA improves long-horizon stability and ultimately strengthens long-context agentic crafting performance.

Extensive empirical results demonstrate that ROME achieves solid and consistent performance across a diverse set of agentic benchmarks. On terminal-centric tasks, ROME achieves 57.4% accuracy on SWE-bench Verified and 24.7% on Terminal-Bench v2.0, outperforming models of similar scale and approaching the performance of larger models exceeding 100B parameters. On the more rigorous Terminal Bench Pro, which enforces stricter contamination control and improved domain balance, ROME still performs competitively, showing strong generalization and stability across domains. Furthermore, ROME has been integrated into iFlow CLI and stably deployed in production. This real-world validation, together with ALE, establishes a robust, scalable, and production-grade foundation for the continual training and enhancement of ROME.

In summary, this technical report presents a reliable, cost-effective, secure, and user-friendly training ecosystem that enables practitioners to build customized models tailored to diverse needs. Beyond a technical stack, ALE is also a call to reframe the community’s priorities. In complex agentic settings, the central challenge

📄 Read Full PDF on ArXiv