Failing to Explore: Language Models on Interactive Tasks

Failing to Explore: Language Models on Interactive Tasks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We evaluate language models on their ability to explore interactive environments under a limited interaction budget. We introduce three parametric tasks with controllable exploration difficulty, spanning continuous and discrete environments. Across state-of-the-art models, we find systematic under-exploration and suboptimal solutions, with performance often significantly worse than simple explore–exploit heuristic baselines and scaling weakly as the budget increases. Finally, we study two lightweight interventions: splitting a fixed budget into parallel executions, which surprisingly improves performance despite a no-gain theoretical result for our tasks, and periodically summarizing the interaction history, which preserves key discoveries and further improves exploration.


💡 Research Summary

This paper investigates how well large language models (LLMs) can explore interactive environments when they are constrained by a fixed interaction budget. To make the problem concrete and measurable, the authors introduce three parametric benchmark tasks—HillSearch, TreeSearch, and MaxSatSearch—each designed to span continuous, graph‑structured, and combinatorial domains respectively, while embedding “traps” that lure an agent toward sub‑optimal solutions early on.

In HillSearch, a hidden smooth function composed of many Gaussian hills contains a single narrow, high peak (the “needle”). The model can query the function value at any point in


Comments & Academic Discussion

Loading comments...

Leave a Comment