We show that iterative deployment of large language models (LLMs), each finetuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards. Preprint. Preliminary version.
In this paper, we show that repeatedly deploying large language models (LLMs) and fine-tuning them on curated data from earlier deployments significantly improves their planning capabilities. This curation can be simply done by validating traces from previous generations, and selecting appropriate valid traces for future training. This mechanism produces a training process conceptually similar to RL fine-tuning but where the reward signal is left implicit. The core idea is simple: repeated deployment starts by users generating text with an LLM after its release. These texts go through a curation process, e.g. texts that do not capture user intent are rejected. Remaining texts are then shared to the web, with scrapes of the web including the curated text used to fine-tune the next generation of the LLM.
Iterative deployment is not a contrived setting: GPT-3.5 was trained on data scraped from the web following GPT-3’s deployment, with data shared on the web at the time including curated texts generated by users using GPT-3 [2]. Similarly, GPT-4 was trained on data shared by users from GPT-3.5 and GPT-3 [17], and so on. With agent workflows becoming more commonplace, future training data will include agent traces from prior model generations, similarly leading to an iterative training on previous generation data. Figure 1 illustrates the basics of this mechanism.
We evaluate this mechanism in the well controlled setting of classical planning. Iterative deployment captures a pattern common to planning: when users use LLMs e.g. to review a product, help solve Figure 1: Single iteration of the iterative deployment mechanism for planning. Using a fixed a set of planning tasks, we prompt the current version of the LLM -referred to as the generation n of the model -to solve these tasks. An external validator (e.g. a human using a chatbot, or a computer programme in the case of planning) identifies the tasks solved correctly. Their traces and plans, together with the traces and plans from tasks solved by previous generations, are then used to fine-tune generation n, producing generation n + 1 of the model. Figure 2: Summary of our main results. Number of solved tasks (with 1000 tasks per domain) for three different domains when comparing the base model with later deployed generations (generations 1, 2, and 5). Average over three separate runs. In all domains, the fifth generation more than doubles the performance of the base model. a reasoning task, or plan a trip, they are more inclined to share LLM results publicly if they are correct. This works as a form of curation, where the users are selecting the correct ‘solutions’ before sharing them. Here, we want to study whether iteratively trained models can improve their planning capabilities having access only to their previously generated curated traces. We simulate the scenario just described in well-controlled environments of classical planning, and focus on the self-improving capabilities of LLMs. First, we prompt a base model to solve a diverse set of planning tasks, mixing together both short-horizon and harder long-horizon planning problems. Then, we discard traces that do not solve the task, mix the remaining into the original data, and fine-tune the next-gen model. The iteratively trained models essentially bootstrap each other’s planning capabilities: each model attempts to generate solutions to the planning tasks, without relying on hand-designed prompts or on external planners. The simple plans solved in earlier generations are used as part of the training set of later generations, allowing the model to use these simple “building blocks” to solve more complicated planning problems.
In our experiments using Qwen3 4B [19], within five (deployment) generations, the latest model generation achieves more than double the base model performance in all tested domains. Figure 2 summarizes the results of our main experiment. In some cases, the performance increases by a factor of 5. Later generations of the model can also find much longer plans than the base model, showing that this mechanism allows for out-of-distribution generalisation. Moreover, there is no significant difference in the average number of reasoning tokens produced by later generations, contrasting some results of RL fine-tuning [5].
We formally prove that iterative deployment is equivalent to a special case of REINFORCE [31] when the reward function is binary and traces are weighted according to importance sampling. This connection has two important implications: First, it highlights a big safety risk in the deployment of iteratively-trained models, as when the curation is done indirectly through user interactions during post-deployment, the next generation model would effectively be trained with an implicit reward function which can be difficult to control. That could have significant implications on the model’s behaviour (e.g. the implicit reward could clash with safety training). Second, t
This content is AI-processed based on open access ArXiv data.