High-dimensional censored MIDAS logistic regression for corporate survival forecasting
This paper addresses the challenge of forecasting corporate distress, a problem marked by three key statistical hurdles: (i) right censoring, (ii) high-dimensional predictors, and (iii) mixed-frequency data. To overcome these complexities, we introduce a novel high-dimensional censored MIDAS (Mixed Data Sampling) logistic regression. Our approach handles censoring through inverse probability weighting and achieves accurate estimation with numerous mixed-frequency predictors by employing a sparse-group penalty. We establish finite-sample bounds for the estimation error, accounting for censoring, MIDAS approximation error, and heavy tails. For statistical inference, we develop a de-sparsified version of the proposed penalized estimator and establish its asymptotic theory, which enables valid statistical inference in high-dimensional settings with censoring. We show that censoring induces a nonstandard variance structure for the de-sparsified estimator, a feature that, to the best of our knowledge, has not been studied in the existing literature. The superior performance of the method is demonstrated through Monte Carlo simulations. Finally, we present an extensive application of our methodology to predict the financial distress of Chinese-listed firms and to identify covariates that are statistically significant for predicting distress. Our novel procedure is implemented in the R package \texttt{Survivalml}.
💡 Research Summary
This paper tackles the notoriously difficult problem of forecasting corporate distress when the data exhibit three intertwined complications: right‑censoring, a high‑dimensional set of predictors, and mixed‑frequency observations. The authors propose a unified methodological framework that simultaneously addresses all three issues. First, they handle censoring by employing outcome‑weighted inverse probability of censoring weighting (OIPCW), a recent tool from survival analysis that re‑weights each observation by the inverse of its probability of remaining uncensored up to the event time. Under the standard independent‑censoring and sufficient‑follow‑up assumptions, the weighted logistic loss depends only on observable quantities (the censored survival time, the censoring indicator, and a conditional survival function of the censoring time).
Second, to cope with the explosion of variables caused by many lags of each predictor, the authors embed the Mixed Data Sampling (MIDAS) approach. Each original covariate is represented by a low‑dimensional basis expansion (e.g., low‑order polynomials or B‑splines) that approximates the effect of all its lags. This creates a natural group structure: every group corresponds to one original variable together with all its lagged effects.
Third, they impose a sparse‑group LASSO penalty on the weighted logistic loss. The penalty consists of an ℓ₁ term that encourages overall sparsity and a group‑wise ℓ₂ term that either retains or discards entire groups. This dual regularisation simultaneously performs variable selection across the many original covariates and lag‑selection within each group, something that a plain LASSO cannot achieve.
The theoretical contributions are twofold. The authors derive finite‑sample error bounds for the penalised estimator that explicitly incorporate (i) the censoring weights, (ii) the MIDAS approximation error, and (iii) heavy‑tailed covariates. By extending the quadratic‑margin condition to accommodate non‑sub‑Gaussian designs, they obtain a convergence rate of order √(s log p / N) where s is the true sparsity and p the total number of parameters.
For inference, they construct a de‑sparsified (or debiased) estimator using nodewise regressions à la van de Geer et al. (2014). Crucially, they show that the presence of censoring induces a non‑standard variance structure in the asymptotic distribution of the de‑sparsified estimator. They provide a consistent estimator of this variance, allowing valid confidence intervals and hypothesis tests for individual coefficients even in the high‑dimensional, censored, mixed‑frequency setting.
Monte‑Carlo experiments confirm that the proposed method outperforms standard logistic regression, ordinary LASSO, and MIDAS‑only approaches across a range of censoring rates (10‑50 %) and dimensionalities (p = 500–2000). The de‑sparsified estimator attains nominal 95 % coverage while maintaining low bias.
The empirical application uses a newly assembled panel of Chinese listed manufacturing firms from 1985 to 2020. The authors collect roughly 30 financial and macro‑economic variables, each observed at monthly, quarterly, and semi‑annual frequencies, yielding over 1,200 potential predictors. After applying OIPCW to adjust for right‑censoring (many firms are still alive at the end of the sample), they fit the sparse‑group MIDAS logistic model. Cross‑validated tuning selects modest penalty levels, and the resulting model achieves an AUC of 0.87, substantially higher than a plain logistic model (0.78) and a standard LASSO (0.81). The de‑sparsified inference identifies liquidity ratio, debt‑to‑asset ratio, sales growth, and a six‑month lag of the manufacturing PMI as statistically significant predictors of distress, providing both predictive power and economic insight.
Finally, the authors release an R package, Survivalml, that implements the full pipeline—from censoring‑weight estimation to MIDAS basis construction, penalised fitting, and de‑sparsified inference—making the methodology readily accessible to practitioners. In sum, the paper delivers a theoretically rigorous, computationally feasible, and empirically validated solution to high‑dimensional censored survival analysis, with immediate relevance for credit risk modelling, regulatory stress testing, and any domain where mixed‑frequency, censored outcomes are encountered.
Comments & Academic Discussion
Loading comments...
Leave a Comment