Fast variable selection for distributional regression with application to continuous glucose monitoring data
With the growing prevalence of diabetes and the associated public health burden, it is crucial to identify modifiable factors that could improve patients’ glycemic control. In this work, we seek to examine associations between medication usage, concurrent comorbidities, and glycemic control, utilizing data from continuous glucose monitor (CGMs). CGMs provide interstitial glucose measurements, but reducing data to simple statistical summaries is common in clinical studies, resulting in substantial information loss. Recent advancements in the Frechet regression framework allow to utilize more information by treating the full distributional representation of CGM data as the response, while sparsity regularization enables variable selection. However, the methodology does not scale to large datasets. Crucially, variable selection inference using subsampling methods is computationally infeasible. We develop a new algorithm for sparse distributional regression by deriving a new explicit characterization of the gradient and Hessian of the underlying objective function, while also utilizing rotations on the sphere to perform feasible updates. The updated method is up to 10000-fold faster than the original approach, opening the door for applying sparse distributional regression to large-scale datasets and enabling previously unattainable subsampling-based inference. Applying our method to CGM data from patients with type 2 diabetes and obstructive sleep apnea, we found a significant association between sulfonylurea medication and glucose variability without evidence of association with glucose mean. We also found that overnight oxygen desaturation variability showed a stronger association with glucose regulation than overall oxygen desaturation levels.
💡 Research Summary
The paper addresses a pressing need in diabetes research: how to exploit the rich information contained in continuous glucose monitoring (CGM) data without collapsing it into a few summary statistics. The authors build on the Frechet regression framework, which treats each subject’s empirical quantile function (the full glucose distribution) as the response in a metric space equipped with the 2‑Wasserstein distance. By adding an ℓ₁‑type sparsity penalty, the method can select a subset of covariates that truly influence the entire distribution rather than just its mean or variance.
The original sparse Frechet regression algorithm (Tucker et al., 2023) updates one component of the weight vector λ at a time, a procedure that becomes computationally prohibitive as the number of covariates grows. Moreover, because each update ignores gradient information, convergence is slow and the method cannot be combined with subsampling‑based inference such as stability selection. To overcome these limitations, the authors derive closed‑form expressions for the gradient and Hessian of the objective function by exploiting the optimality conditions of the embedded constrained problem. They then replace the simplex constraint on λ with a spherical constraint (‖γ‖₂ = τ) and perform updates via rotations on the sphere—essentially a geodesic gradient descent that moves the whole λ vector in a feasible direction at each iteration. The use of the Hessian further enables quasi‑Newton steps, dramatically accelerating convergence.
Empirically, the new algorithm reduces computation time from roughly 1.5 hours (for n = 207, p = 34, 20 tuning parameters) to under a second—a speed‑up of more than 10 000‑fold. This dramatic gain makes it feasible to run the model thousands of times on subsamples of the data, thereby enabling stability selection to assess the reliability of selected variables and to control the false discovery rate.
The methodology is applied to the HYPNOS trial, a randomized study of patients with type 2 diabetes and obstructive sleep apnea (OSA). Each participant’s CGM record (≈2 800 glucose readings over ~10 days) is transformed into a 100‑point empirical quantile function. Covariates include demographics, BMI, HbA1c, medication classes (biguanides, sulfonylureas, etc.), and OSA severity metrics (ODI4, mean saturation, TST90%). After filtering out medications used by fewer than five patients, 34 covariates remain.
Results reveal two clinically relevant findings. First, sulfonylurea use is not associated with the mean glucose level but shows a strong positive association with glucose variability (as captured by the spread of the quantile function). This suggests that sulfonylureas may increase the risk of hypoglycemic excursions without altering average glycemia. Second, nighttime oxygen desaturation variability (measured by TST90%) is more strongly linked to glucose regulation than the overall mean oxygen saturation, highlighting the importance of nocturnal hypoxemia fluctuations in metabolic control.
From a statistical perspective, the paper contributes: (1) an explicit gradient/Hessian derivation for Frechet regression with distributional responses, (2) a novel spherical‑constraint optimization that leverages geodesic updates for massive speed gains, and (3) the first demonstration that stability selection can be performed within this framework, providing rigorous variable‑selection inference. The authors discuss extensions to multivariate distributional responses, non‑linear basis expansions, and GPU‑accelerated implementations for biobank‑scale datasets.
In summary, the work transforms sparse distributional regression from a theoretically appealing but computationally limited tool into a practical, scalable method for high‑dimensional biomedical data. By preserving the full distributional information of CGM signals and enabling reliable variable selection, it opens new avenues for uncovering nuanced physiological relationships that are invisible to traditional mean‑centric analyses.
Comments & Academic Discussion
Loading comments...
Leave a Comment