Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset
Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 TÜİK (TURKSTAT) census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced F 1 from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.
💡 Research Summary
The paper tackles the pressing problem of credit exclusion among underbanked consumers in Istanbul, who typically lack any bureau‑based credit history because their income and payments flow through informal channels. To enable systematic research on alternative‑data‑driven credit scoring, the authors generate a fully synthetic dataset of 100,000 residents that faithfully reproduces the marginal distributions of the Turkish Statistical Institute (TÜİK) Q1‑2025 census and publicly available telecom usage statistics. The data generation pipeline follows six stages: (1) joint sampling of occupation and income from micro‑tables, refined by an OpenAI o3 model to align sector keywords with education levels; (2) assignment of education, allowing modest upward adjustments; (3) mapping of salary to smartphone tier pools derived from e‑commerce sales, with o3 selecting brand and age; (4) rule‑based determination of car ownership, brand tier, district, and rent based on income‑ranked neighborhood maps; (5) synthesis of behavioral features (monthly subscriptions, online shopping frequency, ride‑hailing intensity, social‑media activity) using industry‑derived base rates plus stochastic perturbations; (6) labeling of delinquency (delinquency_FL) through a hybrid rule set that blends employment volatility, device replacement frequency, rent‑to‑income ratio, and shopping volatility. A sanity filter removes economically impossible combinations (e.g., minimum‑wage workers owning luxury cars). No personally identifiable information or traditional bureau variables appear in the final table.
The final schema contains seven socio‑demographic variables (age, education, employment_status, job, monthly_income, home_district, owns_home) and ten alternative attributes (phone_model, phone_purchase_date, owns_car, car_brand, car_purchase_date, owns_credit_card, monthly_subscriptions, online_shopping_frequency, social_media_active, monthly_rent). The target variable is a binary indicator of whether the borrower becomes 30‑plus days past‑due within 12 months.
For modeling, the authors compare “Demo” versions that use only the socio‑demographic block with “Full” versions that augment the same base with the alternative attributes. Three state‑of‑the‑art gradient‑boosting libraries—CatBoost, LightGBM, and XGBoost—are trained on each variant. Hyperparameters are tuned via Bayesian optimization (Tree‑Parzen Estimator) over 50 trials, with balanced class weights, early stopping after 100 rounds without improvement, and both L1/L2 regularization. Nested cross‑validation is employed: an outer five‑fold stratified split preserves the delinquency prevalence, while an inner five‑fold split selects the hyperparameters that maximize mean validation AUC. Logistic regression with elastic‑net, Random Forest (500 trees, depth ≤12), and a single Decision Tree (depth ≤8) serve as baseline comparators.
Performance is evaluated using AUC, precision, recall, and the balanced F1 score. Across all three boosting algorithms, adding the alternative data block yields an average AUC lift of approximately 0.013 (1.3 percentage points) and a substantial increase in balanced F1 from roughly 0.84 to 0.95—a 14 % relative gain. Feature‑importance analysis via SHAP reveals that the most predictive alternative attributes are phone replacement cadence, total monthly subscription spend, online shopping frequency, and car ownership, confirming that these behavioral signals capture financial discipline and asset stability beyond what demographics alone can provide. Fairness diagnostics (group‑wise AUC and F1 across age, gender, and income brackets) show negligible disparities, indicating that the models do not introduce overt bias.
The authors make the synthetic dataset, the full preprocessing and modeling pipeline, and the hyperparameter search scripts publicly available on GitHub (https://github.com/atalaydenknalbant/underbanked_risk_estimation). By releasing a realistic, bureau‑free benchmark, they address a notable gap in the literature where most public datasets still contain traditional credit variables. The paper also outlines a transparent workflow for explainability (SHAP visualizations) and regulatory compliance (group fairness checks), offering lenders and supervisors a reproducible blueprint for extending credit to underbanked populations in a responsible manner.
In conclusion, the study demonstrates that a concise set of non‑financial, digitally derived attributes can approach the discriminative power of traditional bureau scores for a population lacking formal credit histories. The synthetic Istanbul 2025 Q1 dataset and the accompanying reproducible pipeline provide a valuable resource for future research on inclusive, data‑driven credit risk assessment.
Comments & Academic Discussion
Loading comments...
Leave a Comment