Enhanced reasoning and task planning for surgical autonomy using multi-modal large language models with gradual learning

Sadra Zargarzadeh; Jemima Okanlawon; Maryam Mirzaei; Mahan Mohammadi; Mahdi Tavakoli

doi:10.1016/j.birob.2026.100277

Biomimetic Intelligence and Robotics ›› 2026, Vol. 6 ›› Issue (1) :100277 DOI: 10.1016/j.birob.2026.100277

Research Article

research-article

Enhanced reasoning and task planning for surgical autonomy using multi-modal large language models with gradual learning

Author information +

History +

PDF

Abstract

Large language models (LLMs) have been widely adopted in robotic applications in recent years, but their ability in task planning of long-horizon and complex tasks remains a challenge. In this work, we present a gradual learning method to address this challenge and explore its usability in surgical training tasks that require high levels of reasoning, such as peg transfer and the sliding puzzle task. Experiments were conducted using the da Vinci Research Kit (dVRK), with environment feedback initiating follow-up prompts for the LLM when necessary, as well as in a simulation environment. Results showed that for complex tasks, the gradual learning method outperformed the direct approach in the LLM’s task and motion planning, requiring fewer follow-up prompts and leading to higher success rates with faster execution. This suggests that for complex pseudo-surgical tasks, it is more efficient to have the LLM solve simpler versions of the task while incrementally increasing complexity, rather than tackling the full complex task at once. The approach shows promise for enhancing robot-assisted surgery where tasks are complex, long-horizon, and demand high-reasoning abilities.

Keywords

Large Language Models / Reasoning / Task planning / Surgical robotics / Embodied learning for robotics / Zero-shot learning system for robotics

Cite this article

Download citation ▾

Sadra Zargarzadeh, Jemima Okanlawon, Maryam Mirzaei, Mahan Mohammadi, Mahdi Tavakoli. Enhanced reasoning and task planning for surgical autonomy using multi-modal large language models with gradual learning. Biomimetic Intelligence and Robotics, 2026, 6(1): 100277 DOI:10.1016/j.birob.2026.100277

登录浏览全文

4963

注册一个新账户忘记密码

CRediT authorship contribution statement

Sadra Zargarzadeh: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Project administration, Methodology, Investigation, Data curation, Conceptualization. Jemima Okanlawon: Writing – review & editing, Validation, Methodology. Maryam Mirzaei: Writing – review & editing, Writing – original draft, Supervision, Methodology, Investigation, Conceptualization. Mahan Mohammadi: Writing – review & editing, Validation, Methodology. Mahdi Tavakoli: Writing – review & editing, Supervision, Project administration, Investigation, Funding acquisition, Conceptualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by the Canada Foundation for Innovation (CFI), the Natural Sciences and Engineering Research Council (NSERC) of Canada, the Canadian Institutes of Health Research (CIHR), and Alberta Innovates, Canada .

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	R. Taylor, A. Menciassi, G. Fichtinger, P. Fiorini, P. Dario, Medical robotics and computer-integrated surgery, Springer Handbook of Robotics, 2016, pp. 1657-1684.

[2]	P. Fiorini, K. Goldberg, Y. Liu, R. Taylor, Concepts and trends in autonomy for robot-assisted surgery, Proc. IEEE 110 (7) (2022) 993-1011.

[3]	Y. Ou, S. Zargarzadeh, M. Tavakoli, Robot learning incorporating human interventions in the real world for autonomous surgical endoscopic camera control, J. Med. Robot. Res., 8 (3 & 4) (2023),2340004-1.

[4]

A. Pore, D. Corsi, E. Marchesini, D. Dall’Alba, A. Casals, A. Farinelli, P. Fiorini, Safe reinforcement learning using formal verification for tissue retraction in autonomous robotic-assisted surgery, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2021, pp. 4025-4031.

[5]	S. Sen, A. Garg, D. Gealy, S. McKinley, Y. Jen, K. Goldberg, Automating multi-throw multilateral surgical suturing with a mechanical needle guide and sequential convex optimization, 2016 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2016, pp. 4178-4185.

[6]	Y. Ou, A. Soleymani, X. Li, M. Tavakoli, Autonomous blood suction for robot-assisted surgery: A sim-to-real reinforcement learning approach, IEEE Robot. Autom. Lett. 9 (8) (2024) 7246-7253.

[7]	Y. Ou, S. Zargarzadeh, P. Sedighi, M. Tavakoli, A realistic surgical simulator for non-rigid and contact-rich manipulation in surgeries with the da vinci research kit, 2024 21st International Conference on Ubiquitous Robots, UR, IEEE, 2024, pp. 64-70.

[8]	G.-Z. Yang, J. Cambias, K. Cleary, E. Daimler, J. Drake, P. Dupont, N. Hata, P. Kazanzides, S. Martel, R. Patel, et al., Medical robotics—regulatory, ethical, and legal considerations for increasing levels of autonomy, (2017).

[9]	Y. Kim, D. Kim, J. Choi, J. Park, N. Oh, D. Park, A survey on integration of large language models with intelligent robots, Intell. Serv. Robot. 17 (5) (2024) 1091-1107.

[10]	J. Wang, E. Shi, H. Hu, C. Ma, Y. Liu, X. Wang, Y. Yao, X. Liu, B. Ge, S. Zhang, Large language models for robotics: Opportunities, challenges, and perspectives, J. Autom. Intell. 4 (1) (2025) 52-64.

[11]	C. Zhang, J. Chen, J. Li, Y. Peng, Z. Mao, Large language models for human–robot interaction: A review, Biomim. Intell. Robot., (2023), 100131.

[12]	S. Kannan, V. Venkatesh, B.-C. Min, Smart-llm: Smart multi-agent robot task planning using large language models, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2024, pp. 12140-12147.

[13]	H. Liu, Y. Zhu, K. Kato, A. Tsukahara, I. Kondo, T. Aoyama, Y. Hasegawa, Enhancing the llm-based robot manipulation through human–robot collaboration, IEEE Robot. Autom. Lett. 9 (8) (2024) 6904-6911.

[14]	J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, T. Funkhouser, Tidybot: Personalized robot assistance with large language models, Auton. Robots 47 (8) (2023) 1087-1102.

[15]	I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, A. Garg, Progprompt: Generating situated robot task plans using large language models, 2023 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2023, pp. 11523-11530.

[16]	Z. Mandi, S. Jain, S. Song, Roco: Dialectic multi-robot collaboration with large language models, 2024 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2024, pp. 286-299.

[17]	J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, D. Sadigh, Physically grounded vision-language models for robotic manipulation, 2024 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2024, pp. 12462-12469.

[18]	S. Zargarzadeh, M. Mirzaei, Y. Ou, M. Tavakoli, From decision to action in surgical autonomy: Multi-modal large language models for robot-assisted blood suction, IEEE Robot. Autom. Lett., (2025).

[19]	M. Moghani, L. Doorenbos, W.-H. Panitch, S. Huver, M. Azizian, K. Goldberg, A. Garg, Sufia: language-guided augmented dexterity for robotic surgical assistants, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2024, pp. 6969-6976.

[20]	G. Wang, L. Bai, W. Nah, J. Wang, Z. Zhang, Z. Chen, J. Wu, M. Islam, H. Liu, H. Ren, Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery, in: ICLR 2025 Workshop on Foundation Models in the Wild.

[21]	Y. Ding, X. Zhang, C. Paxton, S. Zhang, Task and motion planning with large language models for object rearrangement, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2023, pp. 2086-2092.

[22]	Y. Chen, J. Arkin, C. Dawson, Y. Zhang, N. Roy, C. Fan, Autotamp: Autoregressive task and motion planning with llms as translators and checkers, 2024 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2024, pp. 6695-6702.

[23]	Z. Zhao, W. Lee, D. Hsu, Large language models as commonsense knowledge for large-scale task planning, Adv. Neural Inf. Process. Syst., 36 (2024).

[24]	L. Sun, D. Jha, C. Hori, S. Jain, R. Corcodel, X. Zhu, M. Tomizuka, D. Romeres, Interactive planning using large language models for partially observable robotic tasks, 2024 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2024, pp. 14054-14061.

[25]	S. Wang, M. Han, Z. Jiao, Z. Zhang, Y. Wu, S.-C. Zhu, H. Liu, Llm3̂: Large language model-based task and motion planning with motion failure reasoning, 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2024, pp. 12086-12092.

[26]	Y. Ding, X. Zhang, C. Paxton, S. Zhang, Leveraging commonsense knowledge from large language models for task and motion planning, in: RSS 2023 Workshop on Learning for Task and Motion Planning, 2023.

[27]	Z. Zhou, J. Song, K. Yao, Z. Shu, L. Ma, Isr-llm: Iterative self-refined large language model for long-horizon sequential task planning, 2024 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2024, pp. 2081-2088.

[28]	R. Arora, S. Singh, K. Swaminathan, A. Datta, S. Banerjee, B. Bhowmick, K. Jatavallabhula, M. Sridharan, M. Krishna, Anticipate & act: Integrating llms and classical planning for efficient task execution in household environments, in: International Conference on Robotics and Automation, 2024.

[29]	Z. Long, G. Killick, R. McCreadie, G. Aragon-Camarasa, Robollm: Robotic vision tasks grounded on multimodal large language models, 2024 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2024, pp. 12428-12435.

[30]	S. Izquierdo-Badiola, G. Canal, C. Rizzo, G. Alenyà, Plancollabnl: leveraging large language models for adaptive plan generation in human–robot collaboration, 2024 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2024, pp. 17344-17350.

[31]	W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, et al., Inner monologue: Embodied reasoning through planning with language models, Conference on Robot Learning, PMLR, 2023, pp. 1769-1782.

[32]	W. Huang, P. Abbeel, D. Pathak, I. Mordatch, Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, International Conference on Machine Learning, PMLR, 2022, pp. 9118-9147.

[33]	M. Sridharan, M. Gelfond, S. Zhang, J. Wyatt, Reba: A refinement-based architecture for knowledge representation and reasoning in robotics, J. Artificial Intelligence Res. 65 (2019) 87-180.

[34]	T. Marcucci, M. Petersen, D. von Wrangel, R. Tedrake, Motion planning around obstacles with convex optimization, Sci. Robot. 8 (84) (2023) eadf7843.

[35]	A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., Do as i can, not as i say: Grounding language in robotic affordances, Conference on Robot Learning, PMLR, 2023, pp. 287-318.

[36]	K. Ryu, Q. Liao, Z. Li, P. Delgosha, K. Sreenath, N. Mehr, Curricullm: Automatic task curricula design for learning complex robot skills using large language models, 2025 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2025, pp. 4470-4477.

[37]	Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 41-48.

[38]	M. Xu, Z. Huang, J. Zhang, X. Zhang, Q. Dou, Surgical action planning with large language models, International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2025, pp. 563-572.

[39]	F. Lalys, P. Jannin, Surgical process modelling: a review, Int. J. Comput. Assist. Radiol. Surg. 9 (2014) 495-511.

[40]	N. Vannaprathip, P. Haddawy, H. Schultheis, S. Suebnukarn, Intelligent tutoring for surgical decision making: a planning-based approach, Int. J. Artif. Intell. Educ. 32 (2) (2022) 350-381.

[41]	K. Hutchinson, I. Reyes, Z. Li, H. Alemzadeh, Compass: a formal framework and aggregate dataset for generalized surgical procedure modeling, Int. J. Comput. Assist. Radiol. Surg. 18 (12) (2023) 2143-2154.

[42]	G. Sroka, L. Feldman, M. Vassiliou, P. Kaneva, R. Fayez, G. Fried, Fundamentals of laparoscopic surgery simulator training to proficiency improves laparoscopic performance in the operating room—a randomized controlled trial, Am. J. Surg. 199 (1) (2010) 115-120.

[43]	G. Tyen, H. Mansoor, V. Cărbune, Y. Chen, T. Mak, Llms cannot find reasoning errors, but can correct them given the error location, in: Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 13894-13908.

[44]	P. Kazanzides, Z. Chen, A. Deguet, G. Fischer, R. Taylor, S. DiMaio, An open-source research kit for the da vinci® surgical system, 2014 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2014, pp. 6434-6439.

[45]

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al., Graph of thoughts: Solving elaborate problems with large language models, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 17682-17690.