biniLasso: Automated cut-point detection via sparse cumulative binarization
We present biniLasso and its sparse variant (sparse biniLasso), novel methods for prognostic analysis of high-dimensional survival data that enable detection of multiple cut-points per feature. Our approach leverages the Cox proportional hazards model with two key innovations: (1) a cumulative binarization scheme with $L_1$-penalized coefficients operating on context-dependent cut-point candidates, and (2) for sparse biniLasso, additional uniLasso regularization to enforce sparsity while preserving univariate coefficient patterns. These innovations yield substantially improved interpretability, computational efficiency (4-11x faster than existing approaches), and prediction performance. Through extensive simulations, we demonstrate superior performance in cut-point detection, particularly in high-dimensional settings. Application to three genomic cancer datasets from TCGA confirms the methods’ practical utility, with both variants showing enhanced risk prediction accuracy compared to conventional techniques.
💡 Research Summary
This paper introduces biniLasso and its sparse variant, miniLasso (also called sparse biniLasso), as new methods for prognostic analysis of high‑dimensional survival data that automatically detect multiple cut‑points per covariate. The authors build on the Cox proportional hazards model but replace the traditional one‑hot encoding of continuous predictors with a cumulative binarization scheme. For each continuous variable, a set of candidate cut‑points is defined; for each cut‑point a binary indicator is created that equals 1 when the observation’s value exceeds the cut‑point. This yields a nested (multi‑hot) representation where lower cut‑points are subsets of higher ones, allowing a natural “low vs. all higher” risk comparison that aligns with clinical interpretation of thresholds.
biniLasso applies a standard L1 (lasso) penalty directly to the coefficients of these cumulative binary features. Because the cumulative design matrix is essentially full rank, no additional total‑variation penalty or linear constraints (required in the earlier binacox method) are needed. Consequently, the optimization problem is a convex lasso, solvable with fast algorithms (coordinate descent, FISTA), leading to 4‑11× speed improvements over binacox while preserving a unique sparse solution.
miniLasso augments biniLasso with the recently proposed uniLasso two‑stage regularization. First, univariate Cox models are fitted for each binary indicator, generating leave‑one‑out (LOO) predictions. Second, a non‑negative lasso Cox model is fitted using these LOO predictions as features. This procedure enforces sign consistency with the univariate effects, mitigates multicollinearity inherent in cumulative binarization, and yields a more parsimonious model without requiring feature standardization. The resulting optimization includes an L1 penalty with adaptive weights and a non‑negativity constraint on the coefficients.
To accommodate clinical needs for a limited number of cut‑points per predictor, the authors propose a two‑step screening: (1) fit separate lasso‑penalized Cox models for each predictor’s binary features across a fine grid of λ values, (2) retain the top m most influential cut‑points (those entering earliest or with largest coefficients). The final multivariate model is then built using only these selected cut‑points, avoiding the combinatorial difficulty of directly constraining the number of non‑zero coefficients.
Extensive simulations covering a range of signal strengths, numbers of true cut‑points, and correlation structures demonstrate that both biniLasso and miniLasso outperform binacox, standard Cox‑lasso, and Elastic Net in cut‑point recovery, variable selection accuracy, and concordance index (C‑index). MiniLasso, in particular, achieves high sparsity while maintaining predictive performance, making it attractive for interpretability.
The methods are applied to three TCGA cancer genomics datasets (breast, lung, and gastric cancers) containing thousands of gene expression features and several hundred patients. Both variants achieve higher C‑indices (≈0.68–0.73) than competing approaches and select cut‑points that correspond to known biological thresholds. MiniLasso often selects only 5–7 cut‑points, demonstrating that a compact, clinically usable model can be derived without sacrificing accuracy.
The authors acknowledge limitations: cumulative binarization does not capture interactions between variables; the choice of candidate cut‑points influences computational load; and the non‑negative constraint may be restrictive if the true risk direction differs. Future work is suggested on extending the framework to model interactions, integrating Bayesian priors, and relaxing the sign constraints.
In summary, biniLasso and miniLasso provide a fast, interpretable, and statistically robust solution for automated cut‑point detection in high‑dimensional survival analysis, bridging the gap between data‑driven threshold discovery and practical clinical decision‑making.
Comments & Academic Discussion
Loading comments...
Leave a Comment