Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents’ task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.
💡 Research Summary
The paper presents a systematic empirical investigation of repository‑level context files such as AGENTS.md and their actual impact on the performance of autonomous coding agents. While the practice of adding such files has been widely promoted by LLM providers and agent frameworks, no prior work has quantified whether they help agents solve real‑world software engineering tasks. To fill this gap, the authors evaluate coding agents in two complementary settings.
First, they use the established SWE‑bench Lite benchmark, which consists of issues from popular open‑source Python repositories. For each repository they generate an AGENTS.md file automatically using the recommendations published by agent developers (the “LLM‑generated” condition).
Second, they construct a new benchmark called AGENT‑BENCH. This dataset is curated from twelve niche Python projects that already contain developer‑committed context files. From 5,694 pull requests they extract 138 high‑quality instances, each comprising a detailed task description, the repository state, a gold‑standard patch, and a test suite. Because many PRs lack unit tests, the authors employ LLM agents to synthesize appropriate tests, verify that they fail on the original code and pass on the gold patch, and manually prune over‑specified cases.
Four LLM‑based coding agents are evaluated: OpenAI Sonnet‑4.5, GPT‑4.2, GPT‑4.1, and Mini‑Qwen‑3‑30B. All agents run on comparable harnesses that expose file‑system, shell, and test‑execution tools. Each agent is run under three conditions: (1) no context file, (2) an automatically generated AGENTS.md, and (3) the developer‑provided AGENTS.md. Success is defined as the agent’s predicted patch causing all tests to pass.
Results show a consistent pattern across models and benchmarks. Without any context file, the average success rate is about 45 %. Adding an LLM‑generated context file reduces success by roughly 3 percentage points. Providing the developer‑written file yields a modest 4 point increase relative to the “no‑context” baseline, but the gain is not statistically significant. In terms of computational cost, both types of context files increase token usage and wall‑clock time by more than 20 %, reflecting longer inference and more extensive tool usage.
Trace analysis reveals that context files change agent behavior: agents explore more files, run larger test suites, and faithfully execute the build and test commands described in the file. This leads to broader testing and deeper reasoning, which explains the higher cost. However, the extra exploration does not translate into higher success; instead, the additional “requirements” in many AGENTS.md templates appear to distract the model from the core task.
The authors conclude that current AGENTS.md templates often contain unnecessary requirements that make tasks harder for LLM‑based agents. Minimal context—essentially the tooling commands needed to build and test the repository—offers the best trade‑off. Automatically generated files, as implemented today, tend to hurt performance and should be omitted until generation techniques improve. Human‑written files are only beneficial when they are concise and focused.
The paper contributes (1) the AGENT‑BENCH benchmark for studying context‑file effects on real‑world issues, (2) a thorough cross‑model evaluation showing that LLM‑generated context files generally degrade performance while developer‑written files provide marginal gains, and (3) a detailed behavioral analysis indicating that context files induce more thorough testing and file traversal at the cost of higher inference overhead. The findings suggest that both agent developers and repository maintainers should keep context files minimal and consider disabling automatic generation until more reliable prompting or fine‑tuning methods become available. Future work could extend the analysis to other programming languages, larger codebases, and alternative agent architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment