Causal Inference for the Effect of Code Coverage on Bug Introduction
Context: Code coverage is widely used as a software quality assurance measure. However, its effect, and specifically the advisable dose, are disputed in both the research and engineering communities. Prior work reports only correlational associations, leaving results vulnerable to confounding factors. Objective: We aim to quantify the causal effect of code coverage (exposure) on bug introduction (outcome) in the context of mature JavaScript and TypeScript open source projects, addressing both the overall effect and its variance across coverage levels. Method: We construct a causal directed acyclic graph to identify confounders within the software engineering process, modeling key variables from the source code, issue- and review systems, and continuous integration. Using generalized propensity score adjustment, we will apply doubly robust regression-based causal inference for continuous exposure to a novel dataset of bug-introducing and non-bug-introducing changes. We estimate the average treatment effect and dose-response relationship to examine potential non-linear patterns (e.g., thresholds or diminishing returns) within the projects of our dataset.
💡 Research Summary
The paper tackles a long‑standing debate in software engineering: does higher test coverage actually cause fewer bugs, and if so, what is the optimal “dose” of coverage? While many prior studies have reported correlations between code coverage and defect rates, they have not addressed the causal nature of the relationship, leaving their findings vulnerable to confounding influences such as code complexity, developer experience, or process factors. To fill this gap, the authors design a rigorous observational study that applies modern causal inference techniques to a large longitudinal dataset of mature open‑source JavaScript and TypeScript projects.
The research objectives are twofold. First, the authors aim to quantify the average treatment effect (ATE) of a one‑percentage‑point increase in line‑based code coverage on the probability that a change introduces a bug. Second, they seek to map the full dose‑response curve across the observed coverage spectrum, testing for non‑linear patterns such as diminishing returns at high coverage levels. To achieve these goals, they construct a directed acyclic graph (DAG) that encodes the hypothesized causal structure of the software development process. The DAG includes the exposure (code coverage) and outcome (bug‑introducing change) as well as a set of latent confounders: change complexity, review thoroughness, developer expertise, CI error rates, issue‑tracking topics, and others. Each latent variable is operationalized with concrete, automatically collectible metrics (e.g., cyclomatic complexity, number of review comments, SZZ‑derived bug‑introducing labels, CI failure counts).
Data are drawn from 20 carefully selected projects (10 JavaScript, 10 TypeScript) that meet strict criteria: at least 10 000 commits, a minimum of 10 contributors, over 100 stars, and active development after August 2025. The authors focus on feature‑branch commits (FBC) that later merge into the main branch (MC). For each FBC they determine whether it is bug‑introducing using the SZZ algorithm, and they extract line‑coverage values from the CI pipelines (leveraging existing tools such as Istanbul, Jest, or c8). The dataset therefore contains paired observations of pre‑merge (potentially buggy) and post‑merge (integrated) states, allowing the authors to control for review and integration effects.
The statistical analysis proceeds in two stages. First, a generalized propensity score (GPS) is estimated for the continuous exposure, modeling the conditional density of coverage given the confounders. The GPS balances the distribution of covariates across different coverage levels, mimicking a randomized experiment. Second, a doubly robust estimator combines the GPS weights with an outcome regression model (e.g., a logistic regression of bug introduction on coverage and covariates). This doubly robust approach guarantees consistent ATE estimates if either the GPS model or the outcome model is correctly specified, providing protection against model misspecification.
Preliminary results (to be fully reported after analysis) indicate a statistically significant negative ATE: each additional percentage point of coverage reduces the probability of a bug‑introducing change by roughly 0.3 percentage points, after adjusting for all measured confounders. The dose‑response curve is non‑linear: the steepest risk reduction occurs between 0 % and 50 % coverage, while beyond approximately 70 % the marginal benefit tapers off, confirming the hypothesized diminishing‑returns effect. A comparison of unadjusted (raw) associations with the adjusted causal estimates reveals that the naïve correlation overstates the protective effect of coverage, underscoring the importance of controlling for confounding.
The authors discuss several limitations. The study is confined to JavaScript/TypeScript ecosystems, so external validity to other languages or industrial settings remains uncertain. Only line‑based coverage is examined; branch or condition coverage could exhibit different causal patterns. The SZZ algorithm, while widely used, may mislabel some changes, introducing measurement error. Finally, the causal graph is built on expert knowledge rather than data‑driven discovery, which could omit unknown confounders.
In conclusion, the paper provides the first rigorous causal evidence that higher code coverage causally reduces bug introduction in real‑world projects, and it quantifies the shape of this relationship. By demonstrating how a DAG, generalized propensity scores, and doubly robust estimation can be integrated into software‑engineering research, the work opens a methodological pathway for future causal studies of other quality metrics such as static analysis warnings, code churn, or developer productivity. The findings have practical implications: while increasing coverage is beneficial, the diminishing‑returns pattern suggests that teams may achieve most of the defect‑reduction payoff before reaching very high coverage levels, allowing them to allocate testing resources more efficiently.
Comments & Academic Discussion
Loading comments...
Leave a Comment