Crash Severity Risk Modeling Strategies under Data Imbalance

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study investigates crash severity risk modeling strategies for work zones involving large vehicles (i.e., trucks, buses, and vans) under crash data imbalance between low-severity (LS) and high-severity (HS) crashes. We utilized crash data involving large vehicles in South Carolina work zones from 2014 to 2018, which included four times more LS crashes than HS crashes. The objective of this study is to evaluate the crash severity prediction performance of various statistical, machine learning, and deep learning models under different feature selection and data balancing techniques. Findings highlight a disparity in LS and HS predictions, with lower accuracy for HS crashes due to class imbalance and feature overlap. Discriminative Mutual Information (DMI) yields the most effective feature set for predicting HS crashes without requiring data balancing, particularly when paired with gradient boosting models and deep neural networks such as CatBoost, NeuralNetTorch, XGBoost, and LightGBM. Data balancing techniques such as NearMiss-1 maximize HS recall when combined with DMI-selected features and certain models such as LightGBM, making them well-suited for HS crash prediction. Conversely, RandomUnderSampler, HS Class Weighting, and RandomOverSampler achieve more balanced performance, which is defined as an equitable trade-off between LS and HS metrics, especially when applied to NeuralNetTorch, NeuralNetFastAI, CatBoost, LightGBM, and Bayesian Mixed Logit (BML) using merged feature sets or models without feature selection. The insights from this study offer safety analysts guidance on selecting models, feature selection, and data balancing techniques aligned with specific safety goals, providing a robust foundation for enhancing work-zone crash severity prediction.

💡 Research Summary

This paper addresses the challenging problem of predicting crash severity for large commercial vehicles (trucks, buses, and vans) operating within work zones, with a particular focus on the pronounced class imbalance between low‑severity (LS) and high‑severity (HS) crashes. Using a comprehensive dataset from South Carolina work zones spanning 2014‑2018, the authors assembled 5,351 crash records, of which 4,217 (≈79 %) were classified as LS (property‑damage‑only) and 1,134 (≈21 %) as HS (injury/fatality). The LS‑to‑HS ratio of roughly 4:1 creates a typical imbalance that can bias conventional classifiers toward the majority class, reducing the ability to correctly identify the more critical HS events.

Data preprocessing involved removing variables with >50 % missing values, imputing the remaining missing entries (≤10 %) with mode substitution, and discarding irrelevant spatial identifiers (latitude, longitude, route number, road name) after confirming that all records indeed fell within designated work zones. Continuous variables such as estimated collision speed and base offset distance were discretized into categorical bins based on prior literature and expert judgment. Temporal variables were also simplified (e.g., “time of collision” into five intervals, “day of collision” into weekday vs. weekend).

Feature‑selection techniques evaluated were: Pearson correlation, Random‑Forest feature importance, Recursive Feature Elimination with logistic regression, Chi‑square tests, and Discriminative Mutual Information (DMI). DMI emerged as the most powerful method for isolating features that differentiate HS from LS crashes, delivering the highest mutual information scores and consistently improving HS recall across models.

Modeling approaches covered three families: (1) a statistical Bayesian Mixed Logit (BML) model, (2) a suite of machine‑learning algorithms (CatBoost, LightGBM, XGBoost, Extra Trees, Random Forest), and (3) deep‑learning architectures (NeuralNetTorch, NeuralNetFastAI). In total, twelve distinct model configurations were trained, each under multiple feature‑selection scenarios (DMI‑only vs. full feature set).

Class‑balancing strategies were exhaustively explored, including:

Undersampling: NearMiss‑1, RandomUnderSampler
Oversampling: RandomOverSampler, SMOTE, ADASYN, K‑SMOTE, Wasserstein GAN with Gradient Penalty (WGAN‑GP) and its conditional variant
Hybrid combinations of oversampling and undersampling
Cost‑sensitive weighting of the HS class (HS Class Weighting)

Performance metrics comprised overall accuracy, precision, recall, F1‑score, and AUC‑ROC, with a particular emphasis on HS‑specific recall and AUC because the safety community prioritizes correctly flagging severe crashes.

Key findings:

DMI + Gradient‑Boosting models (CatBoost, LightGBM, XGBoost) achieved HS recall between 0.71 and 0.74 and AUC values above 0.84 even without any resampling, indicating that a well‑chosen feature set can mitigate imbalance effects.
NearMiss‑1 undersampling dramatically boosted HS recall to 0.78 when paired with LightGBM, delivering the highest HS F1‑score (≈0.73). This approach is ideal when the primary goal is to maximize detection of severe events, accepting a modest reduction in LS performance.
RandomUnderSampler and HS class weighting produced more balanced LS/HS metrics. When combined with NeuralNetTorch, NeuralNetFastAI, CatBoost, or LightGBM, they yielded overall accuracies around 0.84, HS recall near 0.66, and relatively stable LS precision, offering a pragmatic trade‑off for agencies that need both overall reliability and reasonable HS detection.
Bayesian Mixed Logit (BML), while offering superior interpretability, lagged behind boosting and deep‑learning models in HS recall (≈0.62) and overall AUC, underscoring the limitations of linear‑type statistical models for highly non‑linear crash data.
Deep‑learning models (especially NeuralNetTorch) performed competitively when supplied with either the merged feature set or DMI‑selected features, and they benefited most from cost‑sensitive weighting rather than aggressive undersampling.

The authors synthesize these results into actionable guidance: for agencies whose priority is maximizing HS detection, the optimal pipeline is NearMiss‑1 → DMI feature selection → LightGBM. For those seeking a balanced performance across severity levels, the recommended configuration is RandomUnderSampler or HS class weighting → DMI or full feature set → NeuralNetTorch/CatBoost/LightGBM.

The paper concludes by emphasizing that feature selection (particularly DMI) can sometimes replace the need for complex resampling, reducing computational overhead and preserving the natural data distribution. It also calls for future work to integrate real‑time traffic and weather feeds, test the models on other states’ datasets for external validity, and develop an operational decision‑support system that can alert work‑zone managers in near‑real time.

Overall, this study provides a thorough, methodologically rigorous comparison of statistical, machine‑learning, and deep‑learning approaches under severe class imbalance, delivering clear, evidence‑based recommendations for transportation safety practitioners aiming to improve crash‑severity prediction for large vehicles in work zones.

Crash Severity Risk Modeling Strategies under Data Imbalance

💡 Research Summary

Comments & Academic Discussion

Leave a Comment