Instructional Text Across Disciplines: A Survey of Representations, Downstream Tasks, and Open Challenges Toward Capable AI Agents
Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Robust understanding of such instructions is essential for deploying LLMs as general-purpose agents that can be programmed in natural language to perform complex, real-world tasks across domains like robotics, business automation, and interactive systems. Despite growing interest in this area, there is a lack of a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 181 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
💡 Research Summary
The paper presents a comprehensive survey of research on complex, multi‑step instructional text and its role in building capable AI agents. While large language models (LLMs) have shown impressive abilities to follow simple, single‑step instructions through instruction tuning, they still struggle with real‑world tasks that involve temporal, conditional, and hierarchical dependencies. To understand the current landscape, the authors performed a systematic PRISMA‑guided review of literature from 2010 to 2026 across major databases (DBLP, IEEE Xplore, Google Scholar, Semantic Scholar). After de‑duplication, screening, and eligibility checks, they identified 181 relevant papers.
The survey is organized around three research questions. RQ1 investigates how instructional text is represented across disciplines. The authors categorize representations into four families: (1) unstructured raw text, (2) event‑centric schemas that capture events, triggers, arguments, and temporal relations, (3) entity‑centric formats that track objects and their state changes, and (4) symbolic structures such as graphs, workflows, and business process models. For each family they list publicly available corpora (e.g., wikiHow, DeScript, ALFRED, CALVIN) and highlight the typical preprocessing or annotation pipelines.
RQ2 maps the downstream tasks that consume these representations. Tasks are split into grounded and ungrounded categories. Grounded tasks involve interaction with an external environment—dialogue agents (ABCD, TEACh), web agents (WebArena, Mind2Web), navigation agents (ALFRED, VirtualHome), GUI agents (Mobile‑Env), robotic agents (CALVIN, Tellex), and game agents (SmartPlay, Minecraft). These tasks require multimodal perception, tool use (e.g., Python interpreters), and planning over long horizons. Ungrounded tasks operate purely on text and include summarization, event alignment, implicit instruction detection/correction, entity tracking, parsing of process structures, question answering, reading comprehension, and knowledge acquisition. The authors provide a detailed table of datasets, evaluation metrics, and dominant modeling approaches (Seq2Seq, Transformers, Graph Neural Networks, reinforcement learning with human feedback).
RQ3 identifies recurring challenges that persist despite methodological advances. First, ambiguity and coreference within instructions lead to misinterpretation. Second, modeling long‑range dependencies and conditional branches remains brittle, causing steep performance drops on multi‑step reasoning. Third, the heterogeneity of representation schemes hampers cross‑domain transfer and limits the reuse of resources. Fourth, existing benchmarks often simplify the environment or restrict the instruction length, failing to reflect real‑world complexity. To address these gaps, the authors propose (i) expanding multimodal, multi‑step datasets that capture realistic procedural scenarios, (ii) developing graph‑based models that explicitly encode event‑entity relations and enable compositional reasoning, (iii) integrating chain‑of‑thought prompting with reinforcement learning from human feedback to improve step‑wise planning, and (iv) establishing standardized, environment‑agnostic evaluation protocols that measure both correctness and efficiency of instruction execution.
The paper also situates its contribution relative to adjacent surveys on event extraction, semantic parsing, grounding, and program synthesis, emphasizing that none of these works provide a unified view of procedural text across NLP, robotics, business intelligence, and computer vision. By offering a taxonomy of data representations, a taxonomy of tasks, and a roadmap of open challenges, the survey serves as a reference point for researchers aiming to build AI agents that can be programmed via natural language to perform complex, real‑world tasks. The authors conclude that a concerted effort to harmonize datasets, representations, and evaluation will be essential for the next generation of instruction‑following agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment