Are Coding Agents Generating Over-Mocked Tests? An Empirical Study
Coding agents have received significant adoption in software development recently. Unlike traditional LLM-based code completion tools, coding agents work with autonomy (e.g., invoking external tools) and leave visible traces in software repositories, such as authoring commits. Among their tasks, coding agents may autonomously generate software tests; however, the quality of these tests remains uncertain. In particular, excessive use of mocking can make tests harder to understand and maintain. This paper presents the first study to investigate the presence of mocks in agent-generated tests of real-world software systems. We analyzed over 1.2 million commits made in 2025 in 2,168 TypeScript, JavaScript, and Python repositories, including 48,563 commits by coding agents, 169,361 commits that modify tests, and 44,900 commits that add mocks to tests. Overall, we find that coding agents are more likely to modify tests and to add mocks to tests than non-coding agents. We detect that (1) 60% of the repositories with agent activity also contain agent test activity; (2) 23% of commits made by coding agents add/change test files, compared with 13% by non-agents; (3) 68% of the repositories with agent test activity also contain agent mock activity; (4) 36% of commits made by coding agents add mocks to tests, compared with 26% by non-agents; and (5) repositories created recently contain a higher proportion of test and mock commits made by agents. Finally, we conclude by discussing implications for developers and researchers. We call attention to the fact that tests with mocks may be potentially easier to generate automatically (but less effective at validating real interactions), and the need to include guidance on mocking practices in agent configuration files.
💡 Research Summary
This paper presents the first large‑scale empirical investigation of whether autonomous coding agents generate tests that over‑use mocking, a practice that can make tests harder to understand, maintain, and validate. The authors focus on three widely adopted agents released in 2025—Claude Code, GitHub Copilot Agent, and Cursor Agent—and examine their activity across 2,168 open‑source repositories written in TypeScript, JavaScript, and Python. Using the SEART GitHub Search Engine, they first filter repositories to those with at least 100 commits, 5 000 non‑blank lines of code, and recent activity, yielding an initial pool of 114,098 projects. They then identify repositories that contain agent‑specific configuration files (e.g., CLAUDE.md, .copilot‑*.md) and, crucially, detect actual agent‑authored commits by searching commit metadata and message trailers for the strings “claude”, “cursor”, or “copilot” (including case‑insensitive variants). Manual validation of 500 sampled commits confirms 100 % precision, resulting in 48,563 agent commits spread across 1,219 repositories.
To locate test‑related changes, the authors apply language‑specific filename patterns (e.g., test_*.py, *_test.py, *.test.ts, *.spec.js) and also consider any file residing in directories whose path contains “test” or “spec”. This yields 169,361 test commits in 1,779 repositories. For mocking detection, they parse the source code of these test commits, extract all identifiers, and flag a commit as a “mock commit” if any identifier contains one of the terms dummy, stub, mock, spy, or fake (case‑insensitive). This approach captures both framework‑based mocks (e.g., Jest’s jest.mock, Python’s unittest.mock) and manually written test doubles. The analysis discovers 44,900 mock commits across 1,032 repositories.
The study is organized around three research questions (RQs). RQ1 asks how frequently coding agents generate tests. The findings show that 60 % of repositories with any agent activity also contain agent‑generated test activity. Moreover, 23 % of agent commits add or modify test files, compared with 13 % for non‑agent commits. Considering all test‑related commits, agents are responsible for 7 % overall, rising to 17 % in repositories created in 2025. RQ2 investigates how often agents introduce mocks in tests. Here, 68 % of repositories with agent‑generated tests also contain mock activity, and 36 % of test commits authored by agents add mocks, versus 26 % for non‑agents. Mock‑related commits constitute 9 % of all mock commits, climbing to 19 % in newly created repositories. In repositories with higher agent activity, the ratio of mock commits is 36 % for agents versus 28 % for humans. RQ3 examines the types of test doubles used. Agents overwhelmingly rely on the “mock” type (95 % of their test‑double usage), while human developers employ a broader mix: mock (91 %), fake (57 %), and spy (51 %). Dummy and stub are rarely used by either group.
From these quantitative results, the authors draw several implications. First, the high proportion of agent‑authored test changes indicates that developers are already leveraging agents not only for production code but also for test maintenance and expansion, suggesting a growing role for autonomous testing assistance. Second, the agents’ strong bias toward mocks may reflect an over‑reliance on isolation techniques; while mocks simplify test generation, they can reduce the test’s ability to validate real interactions, especially when the mock implementation diverges from the evolving production code. Third, because agents tend to use a narrow set of test‑double types, there is a risk of homogeneous, less robust test suites. The authors therefore recommend that project maintainers explicitly encode mocking best‑practice guidelines in agent configuration files (e.g., CLAUDE.md, copilot_instructions.md) to steer agents toward more balanced test generation. Finally, they call for future research to assess the actual fault‑detection effectiveness of agent‑generated tests, explore prompt engineering techniques that encourage diverse test‑double usage, and develop metrics to monitor mock‑overuse over time.
In summary, the paper provides compelling evidence that coding agents are indeed more likely than human developers to generate tests and to embed mocks within them, with the effect being especially pronounced in newer repositories. While this demonstrates the agents’ utility for automated test creation, it also highlights a potential quality concern: over‑mocked tests may be easier to synthesize automatically but less effective at catching real bugs. The study’s dataset, detection methodology, and findings lay a foundation for both practitioners seeking to harness agents responsibly and researchers aiming to improve the reliability of AI‑assisted software testing.
Comments & Academic Discussion
Loading comments...
Leave a Comment