Beyond Bug Fixes: An Empirical Investigation of Post-Merge Code Quality Issues in Agent-Generated Pull Requests
The increasing adoption of AI coding agents has increased the number of agent-generated pull requests (PRs) merged with little or no human intervention. Although such PRs promise productivity gains, their post-merge code quality remains underexplored, as prior work has largely relied on benchmarks and controlled tasks rather than large-scale post-merge analyses. To address this gap, we analyze 1,210 merged agent-generated bug-fix PRs from Python repositories in the AIDev dataset. Using SonarQube, we perform a differential analysis between base and merged commits to identify code quality issues newly introduced by PR changes. We examine issue frequency, density, severity, and rule-level prevalence across five agents. Our results show that apparent differences in raw issue counts across agents largely disappear after normalizing by code churn, indicating that higher issue counts are primarily driven by larger PRs. Across all agents, code smells dominate, particularly at critical and major severities, while bugs are less frequent but often severe. Overall, our findings show that merge success does not reliably reflect post-merge code quality, highlighting the need for systematic quality checks for agent-generated bug-fix PRs.
💡 Research Summary
The paper investigates the post‑merge code quality of pull requests (PRs) automatically generated by AI coding agents. Using the AIDev dataset, the authors selected 1,210 merged bug‑fix PRs from 206 Python repositories. The PRs were authored by five agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude. For each PR, the authors performed a differential static analysis with SonarQube, comparing the base (pre‑merge) commit to the merged commit to identify newly introduced issues.
The study measured issue counts, issue density (issues per thousand lines of code, KLOC), severity levels, and the specific SonarQube rules violated. Raw issue counts varied widely across agents (e.g., Codex introduced 456 issues, Claude only 69), but these differences largely reflected the disparate number of PRs contributed by each agent. After normalizing by code churn, the median issue density was similar for most agents, and a Kruskal‑Wallis test found no statistically significant differences. Cursor was a notable exception, showing a higher median density, indicating that its PRs tend to introduce more issues per unit of code change.
Across all agents, Code Smells dominated the issue landscape, accounting for roughly 70 % of newly introduced problems. These smells were frequently classified as Critical or Major, suggesting non‑trivial maintainability concerns. Bugs were less common (≈5 % of issues) but a large proportion were labeled Blocker or Major, indicating that when functional defects do appear they can be severe. Security Hotspots were relatively rare, yet some High‑severity cases were observed, mainly involving encryption‑related operations and the use of publicly writable directories. No Vulnerabilities were reported.
The most frequently violated rules were:
- Bug rule python:S930 – incorrect number of function arguments (23 occurrences, primarily from Claude).
- Code Smell rule python:S1192 – duplicated string literals (212 occurrences).
- Code Smell rule python:S3776 – excessive cognitive complexity (157 occurrences).
- Code Smell rule python:S1172 – unused function parameters (114 occurrences).
Technical debt, expressed in estimated remediation hours, correlated strongly with PR size and agent contribution volume, reinforcing the observation that larger PRs naturally accrue more debt.
Key insights and recommendations:
- Merge success is not a proxy for quality – AI‑generated bug‑fix PRs can still introduce maintainability and functional defects after being merged.
- Normalize by code churn – Raw issue counts can be misleading; issue density provides a fairer comparison across agents.
- Integrate static analysis into CI/CD – Projects should adopt SonarQube or similar tools as quality gates, blocking merges that exceed predefined issue‑density thresholds or contain high‑severity bugs.
- Focus on maintainability – The prevalence of duplicated literals and high cognitive complexity suggests that AI agents often produce code that is harder to read and maintain. Enforcing maintainability checks (e.g., duplication detection, complexity limits) can mitigate long‑term technical debt.
- Agent‑specific monitoring – Cursor’s higher issue density indicates that certain agents may need tailored prompt engineering or post‑generation linting before submission.
In conclusion, the paper provides the first large‑scale empirical evidence that AI‑generated bug‑fix PRs, despite being merged automatically, can degrade code quality. It underscores the necessity of systematic, size‑aware quality assurance mechanisms beyond unit testing to ensure that productivity gains from coding agents do not come at the expense of software maintainability and reliability.
Comments & Academic Discussion
Loading comments...
Leave a Comment