BTGenBot-2: Efficient Behavior Tree Generation with Small Language Models
Recent advances in robot learning increasingly rely on LLM-based task planning, leveraging their ability to bridge natural language with executable actions. While prior works showcased great performances, the widespread adoption of these models in robotics has been challenging as 1) existing methods are often closed-source or computationally intensive, neglecting the actual deployment on real-world physical systems, and 2) there is no universally accepted, plug-and-play representation for robotic task generation. Addressing these challenges, we propose BTGenBot-2, a 1B-parameter open-source small language model that directly converts natural language task descriptions and a list of robot action primitives into executable behavior trees in XML. Unlike prior approaches, BTGenBot-2 enables zero-shot BT generation, error recovery at inference and runtime, while remaining lightweight enough for resource-constrained robots. We further introduce the first standardized benchmark for LLM-based BT generation, covering 52 navigation and manipulation tasks in NVIDIA Isaac Sim. Extensive evaluations demonstrate that BTGenBot-2 consistently outperforms GPT-5, Claude Opus 4.1, and larger open-source models across both functional and non-functional metrics, achieving average success rates of 90.38% in zero-shot and 98.07% in one-shot, while delivering up to 16x faster inference compared to the previous BTGenBot.
💡 Research Summary
The paper introduces BTGenBot‑2, a 1‑billion‑parameter open‑source small language model (SLM) designed to generate executable behavior trees (BTs) directly from natural‑language task descriptions and a list of robot action primitives. The authors identify two major obstacles that have limited the adoption of large language models (LLMs) in robotics: (1) most prior approaches rely on closed‑source or computationally heavy models that cannot run on resource‑constrained robots, and (2) there is no standardized, plug‑and‑play representation for robot task generation. To address these gaps, BTGenBot‑2 is built on Llama‑3.2‑1B‑Instruct and fine‑tuned using QLoRA, a parameter‑efficient quantized LoRA technique that keeps the base model frozen while training only a small adapter.
Data creation starts from the publicly available TSE dataset (≈600 real‑world BTs). Each BT is expanded threefold using GPT‑4o‑mini, then the expanded set is re‑expanded, yielding a total of 5,204 (instruction, input, output) triples. The “instruction” component provides context and constraints, the “input” contains the natural‑language task description plus the allowed action primitives, and the “output” is an XML‑formatted BT compatible with the ROS2 BehaviorTree.CPP library. The dataset follows the Alpaca schema and is carefully validated for syntactic correctness and action‑space consistency.
Fine‑tuning is performed on two RTX 6000 GPUs (48 GB VRAM each) for about 30 hours, using a 95/5 train‑test split, a batch size of 16, a fixed learning rate of 1e‑4, and up to five epochs. The authors expand LoRA target modules beyond the usual attention projections to include MLP layers, achieving training accuracies above 95 % despite the relatively small token count (≈3.5 M tokens).
A two‑stage error handling strategy is a core contribution. During inference, a validator parses the generated XML, checks it against a YAML‑defined whitelist of allowed BT nodes and robot primitives, and rejects malformed outputs, prompting regeneration. At runtime, an inline logger monitors the BT execution stack and blackboard state; if a node fails, the system automatically regenerates the offending subtree or substitutes a fallback decorator, enabling on‑the‑fly recovery without human intervention.
To evaluate the approach, the authors construct the first standardized benchmark for LLM‑based BT generation, comprising 52 navigation and manipulation tasks in NVIDIA Isaac Sim, organized into three difficulty levels. Functional metrics (success rate, execution time) and non‑functional metrics (memory footprint, latency) are measured. BTGenBot‑2 achieves 90.38 % success in zero‑shot mode and 98.07 % in one‑shot mode, outperforming GPT‑5 (≈78 % zero‑shot) and Claude Opus 4.1 (≈73 %). Compared with the earlier 7 B BTGenBot, BTGenBot‑2 is up to 16× faster (average inference time reduced from ~0.6 s to ~0.04 s) and consumes less than 2 GB of RAM, making it suitable for embedded GPUs.
Real‑world validation is performed on ROS2‑based mobile manipulators and collaborative arms. The generated BTs execute correctly, and the runtime recovery mechanism successfully handles failures such as sensor noise or unexpected obstacles, confirming robustness beyond simulation.
The paper’s contributions are: (1) a publicly released synthetic instruction‑following dataset of 5,204 BT‑instruction pairs, (2) the open‑source BTGenBot‑2 model and ROS2 deployment code, (3) novel inference‑time and runtime error detection/recovery mechanisms, and (4) the first reproducible benchmark for LLM‑driven BT generation.
In discussion, the authors argue that small, efficiently fine‑tuned models can match or exceed the performance of much larger proprietary LLMs when the output format is tightly constrained and validated. They suggest future work on multimodal inputs (vision, force), online continual learning for domain adaptation, and extending compatibility to other robotics frameworks such as ROS 1 and MoveIt. BTGenBot‑2 demonstrates that high‑quality robot planning can be achieved with lightweight, open‑source models, paving the way for broader, on‑device deployment of LLM‑driven autonomy.
Comments & Academic Discussion
Loading comments...
Leave a Comment