Safe Offline Reinforcement Learning for Sepsis Treatment: A Two-Stage Framework Combining Constraint-Aware Learning with Runtime Safety Filtering

Bailing Zhang; Yuwei Mi

doi:10.53941/tai.2026.100007

Transactions on Artificial Intelligence ›› 2026, Vol. 2 ›› Issue (1) :103 -118. DOI: 10.53941/tai.2026.100007

Article

research-article

Safe Offline Reinforcement Learning for Sepsis Treatment: A Two-Stage Framework Combining Constraint-Aware Learning with Runtime Safety Filtering

Bailing Zhang ¹^,^*
, Yuwei Mi ²

Author information +

History +

PDF

Abstract

Reinforcement learning (RL) has shown promise in optimizing treatment strategies for sepsis, a life-threatening condition responsible for significant mortality in intensive care units. However, deploying RL policies in clinical settings requires not only optimizing patient outcomes but also ensuring adherence to established medical guidelines. In this paper, we propose a two-stage safety framework for offline RL-based sepsis treatment. The first stage employs Constraint-Penalized Q-learning combined with Implicit Q-Learning (CPQ-IQL), which incorporates clinical constraints through Lagrangian optimization during policy learning. The second stage applies a runtime safety filter that dynamically validates actions against clinical guidelines before execution. We evaluate our framework on the ICU-Sepsis benchmark with four clinically-motivated constraints derived from the Surviving Sepsis Campaign 2021 guidelines. Experimental results over 5 random seeds demonstrate that CPQ-IQL achieves the lowest constraint violation rate (22.88 ± 0.94%) among all baselines while maintaining competitive survival rates (78.4 ± 1.8%). When combined with the Safe Actions filtering mechanism, constraint violations are reduced by 97.2% (from 22.88% to 0.41%), demonstrating the effectiveness of our two-stage safety framework. Our analysis reveals that the Safe Actions filter modifies approximately 21% of policy decisions, highlighting the importance of runtime safety mechanisms for clinical deployment. These findings suggest that combining constraint-aware offline learning with runtime safety filtering provides a practical pathway toward safe and effective RL-based clinical decision support systems.

Keywords

offline reinforcement learning / safe reinforcement learning / sepsis treatment / clinical decision support / constrained optimization

Cite this article

Download citation ▾

Bailing Zhang, Yuwei Mi. Safe Offline Reinforcement Learning for Sepsis Treatment: A Two-Stage Framework Combining Constraint-Aware Learning with Runtime Safety Filtering. Transactions on Artificial Intelligence, 2026, 2(1): 103-118 DOI:10.53941/tai.2026.100007

登录浏览全文

4963

注册一个新账户忘记密码

Author Contributions

B.Z.: conceptualization, methodology, software, formal analysis, writing—original draft preparation, writing— reviewing and editing, visualization, supervision. Y.M.: validation, investigation, clinical constraint formulation, writing—reviewing and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study used the publicly available ICU-Sepsis benchmark, which is derived from the MIMIC-III database. MIMIC-III is a de-identified dataset with pre-existing ethical approval from the Beth Israel Deaconess Medical Center (Boston, MA, USA). No additional IRB approval was required as this research did not involve direct interaction with human subjects or collection of new patient data.

Informed Consent Statement

Not applicable. This study used a publicly available de-identified dataset and did not involve direct interaction with human subjects.

Data Availability Statement

The ICU-Sepsis benchmark used in this study is publicly available at https://github.com/icu-sepsis/icu-sepsis. The MIMIC-III database, from which the benchmark is derived, is available at https://physionet.org/content/mimiciii/ upon completion of required training and data use agreement. The code for reproducing the experiments will be made available upon publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Use of AI and AI-Assisted Technologies

During the preparation of this work, the authors used Claude (Anthropic) to assist with manuscript editing and proofreading. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Rudd, K.E.; Johnson, S.C.; Agesa, K.M.; et al. Global, regional, and national sepsis incidence and mortality, 1990—2017: Analysis for the Global Burden of Disease Study. Lancet 2020, 395 , 200-211.

[2]	Seymour, C.W.; Gesten, F.; Prescott, H.C.; et al. Time to treatment and mortality during mandated emergency care for sepsis. N. Engl. J. Med. 2017, 376 , 2235-2244.

[3]	Komorowski, M.; Celi, L.A.; Badber, O.; et al. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 2018, 24 , 1716-1720.

[4]	Raghu, A.; Komorowski, M.; Ahmed, I.; et al. Deep reinforcement learning for sepsis treatment. arXiv 2017, arXiv:1711.09602.

[5]	Peng, X.; Ding, Y.; Wihl, D.; et al. Improving sepsis treatment strategies by combining deep and kernel—based reinforcement learning. AMIA Annu Symp Proc. 2018, 2018 , 887-896.

[6]	Futoma, J.; Hughes, M.; Doshi—Velez, F. POPCORN: Partially observed prediction constrained reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Palermo, Italy, 26—28 August 2020; pp. 3578-3588.

[7]	Guo, S.; Wu, D.O. Game theoretical AI for precision medicine. Trans. Artif. Intell. 2025, 1, 170-196.

[8]	Levine, S.; Kumar, A.; Tucker, G.; et al. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643.

[9]	Gottesman, O.; Johansson, F.; Komorowski, M.; et al. Guidelines for reinforcement learning in healthcare. Nat. Med. 2019, 25, 16-18.

[10]	Evans, L.; Rhodes, A.; Alhazzani, W.; et al. Surviving sepsis campaign: International guidelines for management of sepsis and septic shock 2021. Intensive Care Med. 2021, 47 , 1181-1247.

[11]	García, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16 , 1437-1480.

[12]	Brunke, L.; Greeff, M.; Hall, A.W.; et al. Safe learning in robotics: From learning—based control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 411-444.

[13]	Xu, H.; Zhan, X.; Zhu, X. Constraints penalized Q—learning for safe offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 22 February—1 March 2022; Volume 36, pp. 8753-8760.

[14]	Liu, Z.; Cen, Z.; Isenber, V.; et al. Constrained offline policy optimization. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17—23 July 2022; pp. 13644-13668.

[15]	Kostrikov, I.; Nair, A.; Levine, S. Offline reinforcement learning with implicit Q—learning. In Proceedings of the International Conference on Learning Representations, virtual, 25—29 April 2022.

[16]	Killian, T.W.; Zhang, H.; Subramanian, J.; et al. An empirical study of representation learning for reinforcement learning in healthcare. In Proceedings of the Machine Learning for Health Workshop, virtual, 11 December 2020; pp. 139-160.

[17]	Loftus, T.J.; Filiberto, A.C.; Li, Y.; et al. Decision analysis and reinforcement learning in surgical decision—making. Surgery 2020, 168 , 253-266.

[18]	Datta, S.; Li, Y.; Ruppert, M.M.; et al. Reinforcement learning in surgery. Surgery 2021, 170 , 329-332.

[19]	Wu, X.; Li, R.; He, Z.; et al. A value—based deep reinforcement learning model with human expertise in optimal treatment of sepsis. npj Digit. Med. 2023, 6, 15.

[20]	Zhang, T.; Qu, Y.; Wang, D.; et al. Optimizing sepsis treatment strategies via a reinforcement learning model. Biomed. Eng. Lett. 2024, 14 , 279-289.

[21]	Huang, Y.; Cao, R.; Rahmani, A.M. Reinforcement learning for sepsis treatment: A continuous action space solution. In Proceedings of the 7th Machine Learning for Healthcare, Durham, NC, USA, 5—6 August 2022; pp. 631-647.

[22]	Kumar, A.; Zhou, A.; Tucker, G.; et al. Conservative Q—learning for offline reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 6—12 December 2020; Volume 33, pp. 1179-1191.

[23]	Fujimoto, S.; Meger, D.; Precup, D. Off—policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9—15 June 2019; pp. 2052-2062.

[24]	Altman, E. Constrained Markov Decision Processes ; CRC Press: Boca Raton, FL, USA, 1999.

[25]	Achiam, J.; Held, D.; Tamar, A.; et al. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6—11 August 2017; pp. 22-31.

[26]	Tessler, C.; Mankowitz, D.J.; Mannor, S. Reward constrained policy optimization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6—9 May 2019.

[27]	Chow, Y.; Ghavamzadeh, M.; Janson, L.; et al. Risk—constrained reinforcement learning with percentile risk criteria. J. Mach. Learn. Res. 2017, 18 , 6070-6120.

[28]	Yang, Q.; Simao, T.D.; Tindemans, S.H.; et al. WCSAC: Worst—case soft actor critic for safety—constrained reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, virtual, 2—9 February 2021; Volume 35, pp. 10639-10646.

[29]	Thananjeyan, B.; Balakrishna, A.; Nair, S.; et al. Recovery RL: Safe reinforcement learning with learned recovery zones. IEEE Robot. Autom. Lett. 2021, 6, 4915-4922.

[30]	Le, H.; Voloshin, C.; Yue, Y. Batch policy learning under constraints. In International Conference on Machine Learning, Long Beach, CA, USA, 9—15 June 2019; pp. 3703-3712.

[31]	Alshiekh, M.; Bloem, R.; Ehlers, R.; et al. Safe reinforcement learning via shielding. In Proceedings of the Thirty—Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2—7 February 2018; Volume 32, pp. 2669-2678.

[32]	Könighofer, B.; Lorber, F.; Jansen, N.; et al. Shield synthesis for reinforcement learning. In Proceedings of the 9th International Symposium on Leveraging Applications of Formal Methods, Verification and Validation (ISoLA 2020), Rhodes, Greece, 20—30 October 2020; Volume 12476, pp. 290-306.

[33]	Ames, A.D.; Coogan, S.; Egerstedt, M.; et al. Control barrier functions: Theory and applications. In Proceedings of the European Control Conference, Naples, Italy, 25—28 June 2019; pp. 3420-3431.

[34]	Dalal, G.; Dvijotham, K.; Vecerik, M.; et al. Safe exploration in continuous action spaces. arXiv 2018, arXiv:1801.08757.

[35]	Bertsekas, D.P. Nonlinear Programming , 2nd ed.; Athena Scientific: Nashua NH, USA, 1999.

[36]	Johnson, A.E.; Pollard, T.J.; Shen, L.; et al. MIMIC—III, a freely accessible critical care database. Sci. Data 2016, 3, 1-9.

[37]	Mnih, V.; Kavukcuoglu, K.; Silver, D.; et al. Human—level control through deep reinforcement learning. Nature 2015, 518 , 529-533.

PDF

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Cover gallery

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Guidelines for authors

Editorial policy

Ethical requirements

Download templates