Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed
Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13–14% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at https://github.com/safe-autonomy-lab/GlucoSim and https://github.com/safe-autonomy-lab/GlucoAlg.
💡 Research Summary
This paper investigates whether safety guarantees obtained during training of Safe Reinforcement Learning (Safe RL) algorithms persist when the policies are deployed under distribution shift, using diabetes management as a safety‑critical testbed. The authors first build a unified clinical simulator that models both Type 1 and Type 2 diabetes, supporting insulin‑pump and non‑pump therapies. Patient variability is introduced through latent physiological parameters (insulin sensitivity, absorption rates, etc.) and partial adherence, creating a clear gap between training and test populations across three age groups.
Eight representative Safe RL algorithms (including PPO‑Lag, Constrained Policy Optimization, TRPO‑Lagrangian, and others) are trained on a set of simulated patients under a constrained Markov Decision Process formulation where a cost limit enforces safety (e.g., hypoglycemia avoidance, limited intervention frequency). In the training cohort, all algorithms satisfy the cost constraint and achieve reasonable clinical performance (Time‑in‑Range ≈ 55‑60%). However, when evaluated on unseen patients, a “safety generalization gap” emerges: cost violations increase dramatically (30‑45% of episodes), especially for younger patients and Type 2 cohorts, indicating that training‑time constraint satisfaction does not guarantee safety under physiological distribution shift.
To address this gap, the authors propose a test‑time predictive shielding mechanism. Central to the shield is a personalized dynamics model called Basis‑Adaptive Neural ODE (BA‑NODE). BA‑NODE combines an Inverted Transformer that encodes multivariate glucose, insulin, and meal histories, a latent Neural ODE ensemble that provides multiple candidate dynamical modes, and a function‑space adaptation layer that linearly combines these basis trajectories using patient‑specific weights derived from a regularized least‑squares fit on recent context windows. This architecture yields accurate continuous‑time glucose forecasts even under large inter‑patient variability, outperforming standard LSTM/GRU baselines by ~12% lower RMSE.
During deployment, the shield intercepts the policy’s action distribution, selects the top‑k candidate actions (k≈3), and queries BA‑NODE to predict the glucose trajectory for each candidate over a clinically relevant horizon. Actions whose predicted trajectories breach predefined safety bands (e.g., glucose < 70 mg/dL or > 180 mg/dL) are masked; safe alternatives or a “no‑action” fallback are applied, together with immediate clinical overrides such as rescue carbohydrates for predicted hypoglycemia.
Extensive experiments across 72 settings (8 algorithms × 3 diabetes types × 3 age groups) show that shielding consistently reduces cost violations by over 80%, restores Time‑in‑Range by 13‑14 percentage points, and lowers both the clinical risk index and glucose variability. The most notable improvements are observed for PPO‑Lag and CPO, where Time‑in‑Range rises from ~55% to ~68‑70% while maintaining or improving reward.
The study delivers two key insights. First, safety constraints enforced only during training are insufficient for real‑world medical control where patient dynamics shift unseen; explicit evaluation of safety generalization is essential. Second, model‑based test‑time safety verification—implemented here as predictive shielding with a personalized dynamics model—offers a practical, algorithm‑agnostic remedy that does not require retraining the underlying policy.
Future work suggested includes reducing the computational overhead of real‑time shielding, extending the framework to multi‑objective clinical settings (e.g., simultaneous blood pressure control), and validating the approach in actual clinical trials. The released simulator (GlucoSim) and algorithm benchmark (GlucoAlg) provide a valuable platform for the community to explore safety under distribution shift in other safety‑critical control domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment