Artisan: Agentic Artifact Evaluation

Artisan: Agentic Artifact Evaluation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor-intensive, and hence, done only as a one-time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact. The approach is enabled by two key contributions: First, we frame the reproduction problem as a code generation task where the goal is to generate a reproduction script that, when executed, reproduces the results reported in a paper. Unlike prior work on automatically reproducing research results in other domains, this formulation allows for running the script independently of the agent and for assessing the reproduction process at a fine-grained level. Second, we design automated judging mechanism that guides the agent toward the expected results without revealing them and that prevent trivial solutions, such as simply copying checked-in results. To evaluate Artisan, we introduce Artisan-Bench, the first benchmark assessing the ability to generate reproduction scripts and the first benchmark for automated artifact evaluation in software engineering. Artisan-Bench comprises 60 tasks derived from 23 software engineering papers, covering different research areas and programming languages. We validate all tasks in Artisan-Bench for reproducibility to ensure that the tasks are feasible. Our experiments show that Artisan is effective, producing 44/60 reproduction scripts and outperforming the best available baseline, a vanilla LLM agent (mini-swe-agent), by 3.14$\times$ in terms of reproduction scripts generated while taking $0.45 and 48 minutes, on average per task. Artisan also helped uncover 20 new errors in either the paper or artifact.


💡 Research Summary

Artifact evaluation has become a cornerstone of software engineering research, yet its current practice is heavily manual, time‑consuming, and produces only a one‑off “artifact‑evaluated” badge. The paper “Artisan: Agentic Artifact Evaluation” proposes a fundamentally different approach: treat the reproduction of a paper’s results as a code‑generation problem and let a large language model (LLM)‑based agent automatically produce a self‑contained reproduction script. The system, named Artisan, receives three inputs – the research paper, a table of numerical results, and the URL of the associated artifact – and outputs a Bash (or similar) script that, when run independently of the agent, recreates the exact table.

The workflow consists of five stages. First, the table’s numeric entries are obfuscated (replaced with “?”) to hide the expected values from the agent, preventing trivial copying. Second, the artifact is automatically downloaded from the supplied URL (e.g., Zenodo, GitHub). Third, the LLM agent, equipped with two tools—a general‑purpose Bash executor and a format‑conversion utility—interacts with the downloaded files: it reads README files, discovers relevant build or analysis commands, and iteratively executes them. Fourth, the agent writes a candidate reproduction script. Fifth, a two‑tier automated judging mechanism evaluates the script. The first tier checks whether the script’s output matches the obfuscated table after de‑obfuscation; the second tier verifies that the script does not simply copy pre‑computed results (the “CopyRepro” check). If either check fails, the agent receives feedback and continues the loop until a satisfactory script is produced or a step limit is reached.

To evaluate Artisan, the authors introduce Artisan‑Bench, the first benchmark that measures LLM agents’ ability to generate reproduction scripts for software‑engineering papers. Artisan‑Bench comprises 60 tasks derived from 23 papers covering a wide range of sub‑areas (static analysis, smart contracts, testing, etc.) and programming languages (Java, Python, Solidity, etc.). All tasks were manually validated for reproducibility beforehand, ensuring that a correct script exists.

Experimental results show that Artisan generates successful reproduction scripts for 44 out of 60 tasks (73 %). This performance outperforms the strongest baseline, a vanilla LLM agent called mini‑swe‑agent, by a factor of 3.14 in terms of scripts produced. The average wall‑clock time per task is 48 minutes, and the average cloud cost is $0.45, indicating practical efficiency. Importantly, the automated judge’s method‑check successfully blocks scripts that merely copy checked‑in results, forcing the agent to synthesize genuine reproduction logic. Moreover, Artisan uncovered 20 previously unknown inconsistencies between papers and their artifacts (e.g., mismatched numbers, missing steps), demonstrating that automated evaluation can surface real research errors that manual reviews miss.

The paper also discusses limitations. Artisan currently handles only tabular numeric outputs; reproducing figures, images, or more complex visualizations remains an open problem. LLM hallucinations sometimes lead the agent to issue incorrect download commands or to misuse repository APIs, as illustrated by failed trajectories in the paper. While the feedback loop mitigates some failures, full automation still requires occasional human oversight. Future work is suggested in extending the benchmark to multimodal outputs, improving error‑recovery strategies, and fostering community contributions to broaden the benchmark’s coverage.

In summary, Artisan introduces a novel “code‑generation + automated judging” paradigm for artifact evaluation. By producing executable, independently verifiable scripts, it reduces manual effort, scales to many papers, provides fine‑grained reproducibility evidence, and can be run continuously (e.g., before paper submission or as part of CI pipelines). The work represents a significant step toward more reliable, scalable, and automated reproducibility assessment in software engineering research.


Comments & Academic Discussion

Loading comments...

Leave a Comment