On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL
Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, we fine-tune a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, and evaluate both in-domain and cross-domain generalization. While the model reaches 82.9% valid plan rate in in-domain conditions, it achieves 0% on two unseen domains. To analyze this failure, we introduce three diagnostic interventions, namely (i) instance-wise symbol anonymization, (ii) compact plan serialization, and (iii) verifier-reward fine-tuning using the VAL validator as a success-focused reinforcement signal. Symbol anonymization and compact serialization cause significant performance drops despite preserving plan semantics, thus revealing strong sensitivity to surface representations. Verifier-reward fine-tuning reaches performance saturation in half the supervised training epochs, but does not improve cross-domain generalization. For the explored configurations, in-domain performance plateaus around 80%, while cross-domain performance collapses, suggesting that our fine-tuned model relies heavily on domain-specific patterns rather than transferable planning competence in this setting. Our results highlight a persistent generalization gap in LLM-based planning and provide diagnostic tools for studying its causes.
💡 Research Summary
This paper investigates whether fine‑tuned large language models (LLMs) truly acquire transferable planning competence or merely memorize domain‑specific patterns when solving PDDL planning tasks. The authors fine‑tune a 1.7 B‑parameter Qwen‑3 model on 40 000 domain‑problem‑plan tuples drawn from ten International Planning Competition (IPC 2023) domains, using the Gideon data‑generation pipeline to ensure validated, diverse instances. They evaluate the model both on held‑out in‑domain test sets and on two completely unseen domains (Rover and Briefcase). In‑domain, the baseline supervised fine‑tuned model (B) reaches an 82.9 % valid‑plan rate, but on the unseen domains it drops to 0 %, revealing a severe generalization gap.
To diagnose the causes, three experimental variants are introduced:
-
V1 – Instance‑wise Symbol Anonymization: All action, predicate, and object identifiers are replaced with random symbols (e.g., a0, p3, o7) on a per‑tuple basis, breaking any lexical semantics while preserving arity and relational structure. A curriculum gradually increases the proportion of anonymized examples within a single epoch. Results show a substantial performance decline, indicating that the model relies heavily on the semantic cues embedded in symbol names.
-
V2 – Compact Plan Serialization: The plan representation is stripped of timestamps, parentheses, and the terminating “END” token, leaving only the raw sequence of actions. This reduces token length without altering plan semantics. Training on this compact format yields a modest drop in in‑domain accuracy, confirming sensitivity to superficial formatting.
-
V3 – Verifier‑Reward Fine‑Tuning (RL): Starting from the 1‑epoch checkpoint of V2, the authors apply reinforcement learning using Group Relative Policy Optimization (GRPO). For each problem, multiple candidate plans are sampled; each candidate is decoded back to standard PDDL syntax, validated with the VAL tool, and assigned a success‑focused reward based on functional correctness and detailed failure modes. The RL process saturates after roughly half the supervised epochs, providing a slight in‑domain boost but no improvement on unseen domains.
Across all variants, cross‑domain performance remains at 0 %, while in‑domain performance plateaus around 80 %. The diagnostic experiments collectively demonstrate that current LLM‑based planners are highly sensitive to surface‑level representations (symbol names, formatting) and do not learn abstract planning principles that generalize across domains. The paper contributes a multi‑domain benchmark, stress‑test tools (anonymization and compact serialization), and a verifier‑reward RL framework for future research. The authors release the dataset, code, and analysis scripts to facilitate reproducibility and encourage the community to develop methods—such as symbol‑invariant learning, meta‑planning, or tighter integration with external validators—that can bridge the observed generalization gap.
Comments & Academic Discussion
Loading comments...
Leave a Comment