Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs

Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts, by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset’s diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.


💡 Research Summary

The paper tackles the practical shortcomings of generating code with large language models (LLMs) at runtime for software scripting automation. Runtime generation suffers from unverified code, security concerns, high latency, and elevated token costs, especially when serving large user bases. To overcome these issues, the authors propose an offline simulation framework that pre‑curates a software‑specific “skillset” – a repository of verified scripts – which can be retrieved instantly during user interaction.

The framework consists of two main components.

  1. Task Creation – The system must first decide which automation tasks are worth generating. Two complementary strategies are employed:

    • Top‑down functional guidance: High‑level functionalities (e.g., “align objects”, “draw shapes”) are extracted from publicly available scripting guides. For each functionality, an LLM is prompted to produce a set of natural‑language task descriptions. This yields a broad coverage of the software’s core capabilities.
    • Bottom‑up API synergy exploration: All APIs exposed by the target software (Adobe Illustrator in the experiments) are treated as nodes in a graph. Edges represent co‑occurrence of two APIs in an already verified script, defining a “synergistic API pair”. A Graph Convolutional Network (GCN) is trained as a link‑prediction model on this graph, learning both semantic embeddings of API documentation and structural patterns of API co‑use. After training, the model predicts the likelihood that any two APIs can work together, even if they have never been observed together. For each API, the LLM is then prompted with the API plus its top‑k predicted synergistic partners, encouraging the generation of tasks that involve under‑utilized or long‑tailed APIs.
  2. Skill Generation with Trials – For each generated task, an LLM produces an initial ExtendScript (Adobe’s JavaScript‑like scripting language). The script is executed in an isolated Illustrator instance. Execution feedback (error messages, console logs, and visual outcomes) is collected. A second LLM (or a large vision‑language model) acts as a validator, examining the code, the execution output, and the visual result, then providing structured feedback on correctness, style, and alignment with the task intent. The original LLM receives this feedback in a structured prompt and attempts to refine the script. Up to three refinement trials are allowed; scripts that pass validation are added to the final skillset.

The authors evaluate the approach on Adobe Illustrator, a vector‑graphics application exposing 1,818 API endpoints (378 methods, the rest attributes). They compare their offline‑simulation pipeline against a baseline that generates code on the fly for each user query. Key findings include:

  • Higher automation success rate – The curated skillset achieves a 23‑percentage‑point increase in successful task completion, largely due to the inclusion of scripts that leverage previously under‑used APIs discovered by the GCN.
  • Reduced latency – Because runtime queries only require a lookup of an already verified script, average response time drops by a factor of 1.8 compared with on‑the‑fly generation.
  • Lower token cost – Pre‑computing scripts offline eliminates the need for multiple LLM calls at runtime, cutting average token consumption by roughly 35 %.

The paper’s contributions are threefold: (1) introducing an offline simulation pipeline that separates skill discovery from user interaction, thereby mitigating security and cost concerns; (2) leveraging a GNN‑based link‑prediction model to systematically explore API synergy and expand coverage to long‑tailed APIs; (3) demonstrating that iterative refinement using execution feedback and a validator LLM yields high‑quality, verified scripts.

Limitations are acknowledged. The current study focuses solely on Illustrator; extending to other domains (e.g., Photoshop, Microsoft Office) may require additional seed scripts to train the API‑synergy graph. The validator’s reliability depends on the underlying LLM, which could introduce false positives/negatives. Future work includes scaling to multiple software platforms, integrating formal unit‑test suites with visual validation for a hybrid verifier, and continuously updating the skillset using the execution logs collected during offline simulation as fine‑tuning data for the LLM.

In summary, the work presents a compelling blueprint for building cost‑effective, secure, and high‑performance automation assistants for complex software environments by marrying LLM code generation with offline execution feedback and graph‑based API reasoning.


Comments & Academic Discussion

Loading comments...

Leave a Comment