Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring
The advent of artificial intelligence has significantly enhanced credit scoring technologies. Despite the remarkable efficacy of advanced deep learning models, mainstream adoption continues to favor tree-structured models due to their robust predictive performance on tabular data. Although pretrained models have seen considerable development, their application within the financial realm predominantly revolves around question-answering tasks and the use of such models for tabular-structured credit scoring datasets remains largely unexplored. Tabular-oriented large models, such as TabPFN, has made the application of large models in credit scoring feasible, albeit can only processing with limited sample sizes. This paper provides a novel framework to combine tabular-tailored dataset distillation technique with the pretrained model, empowers the scalability for TabPFN. Furthermore, though class imbalance distribution is the common nature in financial datasets, its influence during dataset distillation has not been explored. We thus integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets (e.g., a 2.5% enhancement in AUC). This study presents a novel framework for scaling up the application of large pretrained models on financial tabular datasets and offers a comparative analysis of the influence of class imbalance on the dataset distillation process. We believe this approach can broaden the applications and downstream tasks of large models in the financial domain.
💡 Research Summary
The paper addresses three intertwined challenges that hinder the deployment of large pretrained tabular models such as TabPFN in real‑world credit‑scoring applications: (1) severe class imbalance (default rates often below 5 %), (2) strict privacy and data‑sharing regulations, and (3) the hard input‑size limit of TabPFN, which restricts the number of training samples that can be processed at once. To simultaneously tackle these issues, the authors propose a class‑imbalance‑aware dataset distillation framework that compresses large credit‑scoring datasets into a tiny synthetic support set while preserving the geometry of the minority class.
The core technical contribution lies in augmenting the Kernel Inducing Points (KIP) distillation algorithm with three imbalance‑sensitive loss components: (i) focal loss to up‑weight hard minority examples, (ii) LDAM (label‑distribution‑aware margin) to enlarge decision boundaries for the minority class, and (iii) a class‑weighted mean‑squared‑error term that prevents the overall loss from being dominated by the majority class. By jointly optimizing these objectives, the distilled synthetic set learns to encode minority‑class structure rather than merely mimicking the majority distribution.
Experiments were conducted on six publicly available credit‑scoring datasets (e.g., German Credit, LendingClub, GiveMeSomeCredit). Each dataset was split 80 %/20 % into training and held‑out test sets. The distillation stage produced synthetic support sets ranging from 5 % to 10 % of the original training size. Baselines included (a) random subsets of equal size, and (b) standard KIP distillation using only MSE loss. Downstream models—LightGBM, XGBoost, Logistic Regression, MLP, K‑Nearest Neighbours, and TabPFN—were trained on the original data, the distilled sets, and the random subsets. Performance was measured with AUC, KS, F1‑Score, recall, and precision.
Key findings:
- The imbalance‑aware distillation consistently outperformed the MSE‑only baseline, delivering an average AUC gain of 4.2 percentage points and up to 8.7 pp in the most skewed datasets.
- Distilled sets retained 76 %–95 % of the full‑data AUC while using less than 10 % of the original samples (maximum 31.3 % of training records).
- Across all classifiers, the distilled support sets beat equally sized random subsets, with the most pronounced improvements observed for tree‑based models and TabPFN, which benefits directly from the reduced input size.
- Privacy was evaluated via the Nearest‑Neighbour Distance Ratio (NNDR). Median NNDR values between 0.8 and 0.9 indicated that synthetic points are geometrically distant from any individual real record, suggesting low memorization risk. Formal differential‑privacy guarantees, however, were not provided.
- A meta‑regression analysis linked the magnitude of AUC improvement to the degree of class imbalance and to the weighting of the imbalance‑aware loss terms, confirming that the approach is especially beneficial when the minority class proportion falls below 5 %.
The authors acknowledge that the current privacy assessment is limited to geometric distance metrics and that integrating rigorous differential privacy mechanisms remains future work. They also propose extending the framework to federated learning scenarios where multiple institutions could exchange distilled, privacy‑preserving support sets without exposing raw customer data.
In summary, the study introduces a novel, privacy‑conscious pipeline that (1) compresses large, highly imbalanced credit‑scoring datasets via class‑aware distillation, (2) enables scalable inference with pretrained tabular models such as TabPFN beyond their native sample‑size limits, and (3) demonstrates consistent predictive gains across a variety of downstream classifiers. This work bridges the gap between data‑level imbalance handling, privacy preservation, and model scalability, offering a practical pathway for deploying large pretrained models in the financial sector.
Comments & Academic Discussion
Loading comments...
Leave a Comment