A Task-Level Evaluation of AI Agents in Open-Source Projects

A Task-Level Evaluation of AI Agents in Open-Source Projects
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we present a comparative study of five autonomous coding agents using AIDev-pop, which is a public dataset containing thousands of AI-generated pull requests (PRs) across popular open-source repositories. We evaluate agents’ performance along three task-aware dimensions spanning the PR lifecycle: (1) PR acceptance rate, (2) review discussion volume, and (3) commit message quality. Our quantitative analysis finds that Codex consistently achieves high PR acceptance rates across most task categories, while Copilot’s PRs trigger the highest volume of both human and automated review discussions. In contrast, commit-level quality varies independently of acceptance outcomes. Claude and Cursor produce higher proportions of high-quality commit messages across several task types, and Codex exhibiting comparatively lower commit quality despite strong integration outcomes. Our findings inform selection and improvements of AI agents for their effective integration to collaborative software engineering.


💡 Research Summary

This paper presents a large‑scale, task‑level empirical comparison of five autonomous coding agents—OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude—using the publicly released AIDev‑pop dataset. The dataset comprises 33,549 pull requests (PRs) authored by these agents across 2,807 popular GitHub repositories (each with at least 100 stars) up to August 2025. Each PR is labeled with one of eleven task types (feature addition, bug fix, documentation, build, continuous integration, chore, performance, refactor, style, test, and other), enabling fine‑grained analysis of agent performance across different software‑engineering activities.

The authors define three quantitative metrics that span the PR lifecycle: (1) PR acceptance rate (merged / submitted), measured per agent‑task pair; (2) review discussion volume, captured as the average number of comments per PR, split into human‑generated and bot‑generated comments; and (3) commit‑message quality, assessed with the C‑Good classifier (a BERT‑based BiLSTM model that flags a commit as high quality only when it contains both a “what” description and a “why” rationale). The classifier has an 81.6 % precision based on prior validation.

Key findings:

  • PR Acceptance (RQ1) – Codex achieves the highest overall acceptance rate (0.83) with the lowest standard deviation (0.06), indicating stable performance across tasks. It is the only agent with acceptance >0.80 for both feature additions and bug fixes. Copilot records the lowest overall rate (0.45). Claude and Cursor sit in the mid‑range (≈0.66–0.67). Maintenance‑related tasks (docs, build, CI) generally enjoy higher acceptance than functional tasks. A Mann‑Whitney‑Wilcoxon test confirms the difference between Codex and the next best agent (Cursor) is highly significant (p ≈ 4.27 × 10⁻⁶³).
  • Review Discussion Volume (RQ2) – Copilot generates the most discussion, averaging 1.25 bot comments and 1.31 human comments per PR, far exceeding the other agents. Codex receives virtually no comments (average 0.02 bot, 0.05 human); 98.2 % of its PRs have zero recorded comments. Overall, 90.6 % of PRs in the dataset have no comments, highlighting that many contributions are merged or closed without explicit review. The high comment volume for Copilot correlates with its low acceptance rate, suggesting that more feedback does not necessarily translate into integration success. Statistical testing again shows a significant gap between Copilot and the closest peer (Devin) (p ≈ 7.09 × 10⁻⁵⁰).
  • Commit‑Message Quality (RQ3) – Claude leads with an average “good commit” rate of 0.68, followed by Cursor (0.63) and Devin (0.57). Codex lags substantially at 0.32, despite its strong acceptance performance. Claude’s quality varies more across tasks (SD = 0.19) whereas Devin is the most consistent (SD = 0.07). Documentation, testing, and performance tasks tend to have higher good‑commit rates across agents, while feature additions and bug fixes are weaker. A pairwise Mann‑Whitney‑Wilcoxon test confirms Claude’s superiority over Cursor (p ≈ 4.55 × 10⁻³¹).

The authors interpret these results as evidence that different agents excel on different dimensions: Codex is best for getting code merged with minimal human friction; Copilot provokes the most reviewer interaction, which may be useful for quality assurance but comes at the cost of lower acceptance; Claude and Cursor produce more informative commit messages, which can aid downstream maintenance. The paper also notes that the prevalence of comment‑free PRs—especially for Codex—means that many integration decisions are made without recorded discussion, a limitation of the dataset that warrants further qualitative investigation.

Methodologically, the study contributes a reproducible evaluation pipeline (public replication package) and demonstrates the utility of task‑aware metrics for comparing autonomous agents. The findings inform practitioners about which agent to adopt based on project priorities (e.g., rapid integration vs. thorough review vs. documentation quality) and guide future research on improving agent‑generated commit messages and fostering more transparent review processes.


Comments & Academic Discussion

Loading comments...

Leave a Comment