Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance
The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).
💡 Research Summary
This paper presents a large‑scale, task‑stratified empirical comparison of five widely used AI coding agents—OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code—using the AIDev dataset. After rigorous filtering (only closed PRs with at least one external review from repositories with permissive licenses), the authors retain 7,156 pull requests (PRs) spanning 87 active weeks. The study is organized around three research questions (RQs).
RQ1 – Temporal Evolution:
Weekly acceptance rates (the proportion of merged PRs) are modeled with simple linear regression (y = β₀ + β₁·t + ε) and LOESS smoothing. Devin is the sole agent showing a statistically meaningful upward trend: β₁ = +0.77 percentage points per week, R² = 0.34, rising from ~60 % to ~80 % over a 32‑week window. All other agents display flat trajectories (β₁≈0) with modest R² values, indicating stable performance despite possible model updates or user‑behavior changes.
RQ2 – Performance Drivers:
The authors compute mean acceptance rates (MAR) for nine task categories (chore, docs, ci, build, refactor, feat, fix, test, perf). The gap between the highest (chore 84.0 %) and lowest (perf 55.4 %) is 29 pp, far exceeding inter‑agent differences within any single task. Documentation tasks achieve 82.1 % acceptance, while new‑feature tasks attain only 66.1 %—a 16‑pp disparity. Review frequency also varies: Copilot PRs receive an average of 4.94 reviews per PR, whereas Codex PRs receive only 1.39, suggesting that higher scrutiny may be associated with lower acceptance, although causality cannot be inferred.
RQ3 – Task‑Stratified Agent Comparison:
To control for workload heterogeneity, the study conducts pairwise Pearson chi‑square tests (or Fisher exact tests when expected counts < 5) for each (agent, task) combination, applying a Bonferroni correction (α = 0.05/64 ≈ 0.00078). Out of 64 tests, six reach significance, all concerning fix or feat tasks. Notable findings include:
- Codex vs. Devin on fix tasks – φ = 0.39 (medium effect), p < 0.0007, Codex wins.
- Copilot vs. Devin on fix – φ = 0.20 (small effect), p < 0.0007, Copilot wins.
- Codex vs. Copilot on fix – φ = 0.19 (small effect), p < 0.0007, Codex wins.
- Codex vs. Devin on feat – φ = 0.11 (small effect), p < 0.0007, Codex wins.
Overall, OpenAI Codex displays consistently high acceptance across all nine tasks (59.6 %–88.6 %), leading in fix (83.0 %) and refactor (74.3 %). Claude Code excels in documentation (92.3 %) and features (72.6%) despite a limited sample size (139 PRs). Cursor shows the strongest performance on fix tasks (80.4 %) and test tasks (77.8 %).
Sensitivity Analysis:
Because agents were observed over different calendar windows, the authors repeat the entire analysis using the 11‑week interval common to all agents (May 19–July 30 2025). The pattern persists: Codex retains the highest overall acceptance (79.9 %), followed by Cursor (74.4 %). This robustness check confirms that the main conclusions are not artefacts of unequal observation periods.
Discussion & Implications:
The paper highlights three key take‑aways: (1) Temporal dynamics are heterogeneous—only Devin improves over time; (2) Task type is the dominant factor influencing PR acceptance, dwarfing inter‑agent variance; (3) No single agent dominates across all tasks, suggesting that tool selection should be task‑aware. The authors caution that acceptance rate alone may mask long‑term quality issues (e.g., static‑analysis warnings, technical debt) and advocate for multi‑dimensional evaluation frameworks in future work.
Conclusion:
By integrating temporal trend analysis, task‑level stratification, and rigorous statistical testing, the study provides a nuanced portrait of AI coding agents in real‑world development settings. It demonstrates that while Codex offers the most stable overall performance, specialized agents such as Claude Code and Cursor can outperform it on specific task categories. The findings inform practitioners choosing coding assistants, tool developers prioritizing feature improvements, and researchers designing more granular, longitudinal evaluation methodologies for AI‑assisted software engineering.
Comments & Academic Discussion
Loading comments...
Leave a Comment