Safe Offline Reinforcement Learning for Sepsis Treatment: A Two-Stage Framework Combining Constraint-Aware Learning with Runtime Safety Filtering
Bailing Zhang , Yuwei Mi
Transactions on Artificial Intelligence ›› 2026, Vol. 2 ›› Issue (1) : 103 -118.
Reinforcement learning (RL) has shown promise in optimizing treatment strategies for sepsis, a life-threatening condition responsible for significant mortality in intensive care units. However, deploying RL policies in clinical settings requires not only optimizing patient outcomes but also ensuring adherence to established medical guidelines. In this paper, we propose a two-stage safety framework for offline RL-based sepsis treatment. The first stage employs Constraint-Penalized Q-learning combined with Implicit Q-Learning (CPQ-IQL), which incorporates clinical constraints through Lagrangian optimization during policy learning. The second stage applies a runtime safety filter that dynamically validates actions against clinical guidelines before execution. We evaluate our framework on the ICU-Sepsis benchmark with four clinically-motivated constraints derived from the Surviving Sepsis Campaign 2021 guidelines. Experimental results over 5 random seeds demonstrate that CPQ-IQL achieves the lowest constraint violation rate (22.88 ± 0.94%) among all baselines while maintaining competitive survival rates (78.4 ± 1.8%). When combined with the Safe Actions filtering mechanism, constraint violations are reduced by 97.2% (from 22.88% to 0.41%), demonstrating the effectiveness of our two-stage safety framework. Our analysis reveals that the Safe Actions filter modifies approximately 21% of policy decisions, highlighting the importance of runtime safety mechanisms for clinical deployment. These findings suggest that combining constraint-aware offline learning with runtime safety filtering provides a practical pathway toward safe and effective RL-based clinical decision support systems.
offline reinforcement learning / safe reinforcement learning / sepsis treatment / clinical decision support / constrained optimization
| [1] |
Rudd, K.E.; Johnson, S.C.; Agesa, K.M.; |
| [2] |
Seymour, C.W.; Gesten, F.; Prescott, H.C.; |
| [3] |
Komorowski, M.; Celi, L.A.; Badber, O.; |
| [4] |
Raghu, A.; Komorowski, M.; Ahmed, I.; |
| [5] |
Peng, X.; Ding, Y.; Wihl, D.; |
| [6] |
Futoma, J.; Hughes, M.; Doshi—Velez, F. POPCORN: Partially observed prediction constrained reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Palermo, Italy, 26—28 August 2020; pp. 3578-3588. |
| [7] |
Guo, S.; Wu, D.O. Game theoretical AI for precision medicine. Trans. Artif. Intell. 2025, 1, 170-196. |
| [8] |
Levine, S.; Kumar, A.; Tucker, G.; |
| [9] |
Gottesman, O.; Johansson, F.; Komorowski, M.; |
| [10] |
Evans, L.; Rhodes, A.; Alhazzani, W.; |
| [11] |
García, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16 , 1437-1480. |
| [12] |
Brunke, L.; Greeff, M.; Hall, A.W.; |
| [13] |
Xu, H.; Zhan, X.; Zhu, X. Constraints penalized Q—learning for safe offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 22 February—1 March 2022; Volume 36, pp. 8753-8760. |
| [14] |
Liu, Z.; Cen, Z.; Isenber, V.; |
| [15] |
Kostrikov, I.; Nair, A.; Levine, S. Offline reinforcement learning with implicit Q—learning. In Proceedings of the International Conference on Learning Representations, virtual, 25—29 April 2022. |
| [16] |
Killian, T.W.; Zhang, H.; Subramanian, J.; |
| [17] |
Loftus, T.J.; Filiberto, A.C.; Li, Y.; |
| [18] |
Datta, S.; Li, Y.; Ruppert, M.M.; |
| [19] |
Wu, X.; Li, R.; He, Z.; |
| [20] |
Zhang, T.; Qu, Y.; Wang, D.; |
| [21] |
Huang, Y.; Cao, R.; Rahmani, A.M. Reinforcement learning for sepsis treatment: A continuous action space solution. In Proceedings of the 7th Machine Learning for Healthcare, Durham, NC, USA, 5—6 August 2022; pp. 631-647. |
| [22] |
Kumar, A.; Zhou, A.; Tucker, G.; |
| [23] |
Fujimoto, S.; Meger, D.; Precup, D. Off—policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9—15 June 2019; pp. 2052-2062. |
| [24] |
Altman, E. Constrained Markov Decision Processes ; CRC Press: Boca Raton, FL, USA, 1999. |
| [25] |
Achiam, J.; Held, D.; Tamar, A.; |
| [26] |
Tessler, C.; Mankowitz, D.J.; Mannor, S. Reward constrained policy optimization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6—9 May 2019. |
| [27] |
Chow, Y.; Ghavamzadeh, M.; Janson, L.; |
| [28] |
Yang, Q.; Simao, T.D.; Tindemans, S.H.; |
| [29] |
Thananjeyan, B.; Balakrishna, A.; Nair, S.; |
| [30] |
Le, H.; Voloshin, C.; Yue, Y. Batch policy learning under constraints. In International Conference on Machine Learning, Long Beach, CA, USA, 9—15 June 2019; pp. 3703-3712. |
| [31] |
Alshiekh, M.; Bloem, R.; Ehlers, R.; |
| [32] |
Könighofer, B.; Lorber, F.; Jansen, N.; |
| [33] |
Ames, A.D.; Coogan, S.; Egerstedt, M.; |
| [34] |
Dalal, G.; Dvijotham, K.; Vecerik, M.; |
| [35] |
Bertsekas, D.P. Nonlinear Programming , 2nd ed.; Athena Scientific: Nashua NH, USA, 1999. |
| [36] |
Johnson, A.E.; Pollard, T.J.; Shen, L.; |
| [37] |
Mnih, V.; Kavukcuoglu, K.; Silver, D.; |
/
| 〈 |
|
〉 |