An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.


💡 Research Summary

The paper conducts a thorough empirical investigation into the class‑imbalance problem that plagues deep‑learning (DL) based software vulnerability detection. Recognizing that vulnerable functions constitute only a tiny fraction of all source‑code functions (often less than 1 in several hundred), the authors hypothesize that this imbalance is a primary cause of the widely reported variability in model performance across datasets. To test the hypothesis, they assemble nine publicly available vulnerability datasets (e.g., Lin2018, Devign, etc.) and fine‑tune two state‑of‑the‑art foundation models, CodeBERT and GraphCodeBERT, under a uniform experimental pipeline.

Four research questions guide the study. RQ1 examines how imbalance affects training dynamics; the authors observe that loss on vulnerable samples remains higher, leading to a model that quickly minimizes error on the majority (secure) class while neglecting the minority, resulting in a high false‑negative rate. RQ2 evaluates which metrics are appropriate; they demonstrate that accuracy can be misleading in highly skewed settings and argue that precision, recall, and F1‑score provide a more realistic picture of detection capability. RQ3 compares seven imbalance‑mitigation techniques drawn from computer‑vision and NLP: three data‑level methods (random down‑sampling, random over‑sampling, adversarial‑attack‑based augmentation) and four model‑level methods (Mean False Error loss, Class‑Balanced loss, Focal loss, and threshold‑moving). The experiments reveal distinct strengths: Focal loss markedly improves precision by focusing on hard, mis‑classified examples; Mean False Error and Class‑Balanced losses boost recall by weighting minority‑class errors more heavily; random over‑sampling yields the highest F1‑score by restoring class balance; threshold‑moving can fine‑tune a specific metric but is highly sensitive to distribution shifts. RQ4 explores external factors that modulate the effectiveness of these techniques, such as the presence or absence of certain vulnerability types, the intrinsic difficulty of detecting particular bugs (e.g., memory overflows), and disparities between training and test data distributions. They find that when a vulnerability type is scarce or especially hard to detect, all methods suffer, and over‑sampling may cause over‑fitting if synthetic samples do not faithfully preserve syntactic and semantic correctness.

Overall, the study confirms that class imbalance is a central obstacle for DL‑based vulnerability detection and that no single mitigation strategy dominates across all evaluation criteria and datasets. The authors advocate for future work that (1) designs loss functions that incorporate vulnerability‑type difficulty, (2) develops code‑aware augmentation techniques that guarantee syntactic and semantic validity, and (3) adopts multi‑objective optimization to jointly maximize precision, recall, and F1‑score. All datasets, code, and experimental artifacts are released publicly to ensure reproducibility and to encourage the community to build upon these findings.


Comments & Academic Discussion

Loading comments...

Leave a Comment