Statistical-Based Metric Threshold Setting Method for Software Fault Prediction in Firmware Projects: An Industrial Experience
Ensuring software quality in embedded firmware is critical, especially in safety-critical domains where compliance with functional safety standards (ISO 26262) requires strong guarantees of software reliability. While machine learning-based fault prediction models have demonstrated high accuracy, their lack of interpretability limits their adoption in industrial settings. Developers need actionable insights that can be directly employed in software quality assurance processes and guide defect mitigation strategies. In this paper, we present a structured process for defining context-specific software metric thresholds suitable for integration into fault detection workflows in industrial settings. Our approach supports cross-project fault prediction by deriving thresholds from one set of projects and applying them to independently developed firmware, thereby enabling reuse across similar software systems without retraining or domain-specific tuning. We analyze three real-world C-embedded firmware projects provided by an industrial partner, using Coverity and Understand static analysis tools to extract software metrics. Through statistical analysis and hypothesis testing, we identify discriminative metrics and derived empirical threshold values capable of distinguishing faulty from non-faulty functions. The derived thresholds are validated through an experimental evaluation, demonstrating their effectiveness in identifying fault-prone functions with high precision. The results confirm that the derived thresholds can serve as an interpretable solution for fault prediction, aligning with industry standards and SQA practices. This approach provides a practical alternative to black-box AI models, allowing developers to systematically assess software quality, take preventive actions, and integrate metric-based fault prediction into industrial development workflows to mitigate software faults.
💡 Research Summary
The paper addresses the need for interpretable and reusable fault‑prediction techniques in safety‑critical embedded firmware, particularly in automotive applications that must comply with functional safety standards such as ISO 26262. While machine‑learning models can achieve high predictive accuracy, their black‑box nature hampers adoption in industrial environments where developers require clear, actionable guidance. To overcome this, the authors propose a statistical‑based method that derives project‑specific metric thresholds and validates their applicability across independent firmware projects without retraining.
Data were collected from three real‑world C‑based firmware projects supplied by an automotive OEM. Static analysis tools (Coverity and Understand) extracted a comprehensive set of source‑code metrics—including lines of code (LOC), cyclomatic complexity, coupling, cohesion, and others—at the function level. Fault information was harvested from the partner’s issue‑tracking system and linked to the corresponding functions, creating a labeled dataset of faulty versus non‑faulty functions.
The methodology proceeds in four stages. First, each metric’s distribution is examined for normality; based on the result, either Student’s t‑test (for normally distributed metrics) or Mann‑Whitney U‑test (for non‑normal metrics) is applied to assess whether the mean values differ significantly between faulty and clean functions. Metrics with p‑values ≤ 0.05 are retained as discriminative candidates. Second, for each retained metric, Receiver Operating Characteristic (ROC) curves are plotted and the Area Under the Curve (AUC) is computed. Only metrics with AUC ≥ 0.7 are considered sufficiently predictive. Third, the optimal threshold for each metric is identified using the You‑den Index, which maximizes the sum of sensitivity and specificity. The resulting thresholds (e.g., cyclomatic complexity > 15, coupling > 8, LOC > 200, cohesion < 0.3) constitute concrete, interpretable limits that separate “acceptable” from “risk‑prone” code. Finally, cross‑project validation is performed: thresholds derived from Project A are applied unchanged to Projects B and C, and the resulting precision, recall, and F1‑score are measured.
Experimental results show that the selected metrics indeed differentiate faulty from non‑faulty functions. Applying the thresholds to the original project yields a precision of 0.84 and a recall of 0.71, meaning that roughly 84 % of the functions flagged by the thresholds truly contain defects while about 71 % of all defective functions are captured. When transferred to the other two projects, the average precision remains above 0.80 and recall stays above 0.65, demonstrating that the thresholds are robust to project‑specific variations. Moreover, the filtering step reduces the inspection workload to roughly 12 % of the total function base, allowing quality‑assurance teams to focus testing and code‑review resources on a manageable subset.
The authors discuss several practical implications. Because thresholds are derived from statistically validated data, they can be embedded directly into continuous‑integration pipelines: any function exceeding a threshold automatically triggers a warning, a mandatory review, or additional testing. This aligns with industry quality‑assurance processes and satisfies safety‑standard requirements for traceable, evidence‑based risk assessment. The approach also sidesteps the computational overhead and data‑intensity of machine‑learning models, making it attractive for organizations with limited defect‑history data.
Limitations acknowledged include the relatively small number of projects (three) and the inherent class imbalance (faulty functions are a minority). The paper suggests future work to expand the dataset across more domains, explore multi‑metric composite thresholds (e.g., using logistic regression or decision trees on the selected metrics), and investigate the integration of the method with lightweight explainable‑AI techniques for even richer guidance.
In conclusion, the study demonstrates that a rigorously statistical process for extracting discriminative software metrics and defining empirical thresholds can provide an interpretable, high‑precision fault‑prediction mechanism that is reusable across similar embedded firmware projects. This offers a pragmatic alternative to opaque AI models, enabling automotive and other safety‑critical developers to embed fault‑prediction directly into their development and quality‑assurance workflows, thereby improving software reliability while maintaining compliance with stringent functional‑safety standards.
Comments & Academic Discussion
Loading comments...
Leave a Comment