Introduction
Advances in surgery have revolutionized the management of both acute and chronic diseases, prolonging life and extending the boundary of patient survival. These advances are underpinned by continuing technological developments in diagnosis, imaging, and surgical instrumentation. Complex surgical navigation and planning are made possible through the use of both pre- and intraoperative imaging techniques, such as ultrasound, computed tomography (CT), and magnetic resonance imaging (MRI) [
1]. Surgical trauma is reduced through minimally invasive surgery (MIS), which is now progressively combined with robotic assistance [
2]. Postoperative care is also improved by sophisticated wearable and implantable sensors for supporting early discharge after surgery, thereby enhancing patient recovery and early detection of postsurgical complications [
3,
4]. Numerous terminal illnesses have been transformed into clinically manageable chronic lifelong conditions, and surgery is increasingly focused on the systematic effects of this procedure on patients, avoiding isolated surgical treatment or anatomical alteration, with careful consideration of metabolic, hemodynamic, and neurohormonal consequences that can influence the quality of life.
Owing to recent advances in medicine, artificial intelligence (AI) has played an important role in supporting clinical decision-making since the early years of the development of the MYCIN system [
5]. AI is now increasingly used for risk stratification, genomics, imaging and diagnosis, precision medicine, and drug discovery. AI was introduced into surgery more recently, with a strong root in imaging and navigation and early techniques focusing on feature detection and computer-assisted intervention for both preoperative planning and intraoperative guidance. Over the years, supervised algorithms, such as active-shape models, atlas-based methods, and statistical classifiers, have been developed [
1]. The recent successes of deep convolutional neural network (DCNN), such as AlexNet [
6], have enabled automatically learned data-driven descriptors to be used for image understanding, which have shown improved robustness and generalizability compared with ad hoc hand-crafted features.
As robotics is increasingly applied in surgery, AI is set to transform the field through the development of sophisticated functions connecting real-time sensing to robotic control. Varying levels of autonomy can allow the surgeon and robotic system to navigate together the constantly changing and patient-specific environments, a situation that may reduce the ability of either to complete a surgical task effectively. Additionally, by leveraging the parallel medical advances in early detection and targeted therapy, AI can ensure that the proper intervention is executed. Future surgical robots would expectedly be able to perceive and understand complicated surroundings, conduct real-time decision-making, and perform desired tasks with increased precision, safety, and efficiency. But what are the roles of AI in these systems and the future of surgery in general? How can these systems deal with dynamic environments and learn from human operators? How can reliable control policies be derived to achieve human–machine symbiosis?
In this article, we review the applications of AI in preoperative planning, intraoperative guidance, and its integrated use in surgical robotics. Popular AI techniques, including an overview of their requirements, challenges, and sub-areas in surgery are summarized in Fig. 1, which outlines the main flow of the contents of the paper. We first introduce the application of AI in preoperative planning. We then discuss several AI techniques for intraoperative guidance and review the applications of AI in surgical robotics. Finally, we provide our conclusions and future outlook. Technically, we put a strong emphasis on deep learning-based approaches in this review.
AI for preoperative planning
Preoperative planning where surgeons plan the surgical procedure on the basis of existing medical records and imaging is essential for the success of a surgery. X-ray, CT, ultrasound, and MRI are the most common imaging modalities used in clinical practice. Routine tasks based on medical imaging include anatomical classification, detection, segmentation, and registration.
Classification
Classification outputs the diagnostic value of the input, which is a single or a set of medical images or volumes of organs or lesions. Aside from traditional machine-learning and image-analysis techniques, deep learning-based methods are growing in popularity [
7]. The network architecture of deep learning-based methods for classification is composed of convolutional layers for extracting information from the input and fully connected layers for regressing the diagnostic value.
For example, a classification pipeline using Google’s Inception and ResNet architecture was proposed to segment lung, bladder, and breast cancers [
8]. Chilamkurthy
et al. demonstrated that deep learning can recognize intracranial hemorrhage, calvarial fracture, midline shift, and mass effect from head CT scans [
9]. The mortality, renal failure, and postoperative bleeding in patients after cardiosurgical care can be predicted by recurrent neural network (RNN) in real time with improved accuracy compared with standard-of-care clinical tools [
10]. ResNet-50 and Darknet-19 are used to classify benign or malignant lesions in ultrasound images, showing similar sensitivity and improved specificity [
11].
Detection
Detection provides the spatial localization of regions of interest, often in the form of bounding boxes or landmarks, and may also include image- or region-level classification. Similarly, deep learning-based approaches have shown promise in detecting various anomalies or medical conditions. DCNNs for detection usually consist of convolutional layers for feature extraction and regression layers to determine the bounding box properties.
A deeply stacked convolutional autoencoder was trained to extract statistical and kinetic biological features for detecting prostate cancer from 4D positron-emission tomography images [
12]. A 3D CNN with roto-translation group convolutions was proposed for pulmonary nodule detection with good accuracy, sensitivity, and convergence speed [
13].
Deep reinforcement learning (DRL) based on an extension of the deep Q-network was used to learn a search policy from dynamic contrast-enhanced MRI for detecting breast lesions [
14]. To detect acute intracranial hemorrhage from CT scans and improve network interpretability, Lee
et al. [
15] used an attention map and an iterative process to mimic the workflow of radiologists.
Segmentation
Segmentation can be treated as a pixel- or voxel-level image classification problem. Owing to the limitation of computational resources in early works, each image or volume was previously divided into small windows, and CNNs were trained to predict the target label at the central location of the window. Image- or voxel-wise segmentation was achieved by running the CNN classifier over densely sampled image windows. For example, Deepmedic exhibited good performance for multimodal brain tumor segmentation from MRI [
16]. However, the sliding window-based approach is inefficient as the network’s function is repeatedly computed in regions where many windows are overlapping. For this reason, the sliding window-based approach was recently replaced by fully convolutional networks (FCNs) [
17]. The key idea of FCNs is to replace fully connected layers in a classification network with convolutional layers and upsampling layers, a process that substantially improves segmentation efficiency. Encoder–decoder networks, such as U-Net [
18,
19], have shown promising performance in medical-image segmentation. The encoder has multiple convolutional and downsampling layers that extract image features at different scales. The decoder has convolutional and upsampling layers that recover the spatial resolution of feature maps and finally achieve pixel- or voxel-wise dense segmentation. A review of different normalization methods for training U-Net for medical-image segmentation is provided by Zhou and Yang [
20].
For navigation during endoscopic pancreatic and biliary procedures, Gibson
et al. [
21] used dilated convolutions and fused-image features in multiple scales to segment abdominal organs from CT scans. For interactive segmentation of placenta and fetal brains from MRI, FCN and user-defined bounding boxes and scribbles were combined, where the last few layers of FCN were fine-tuned according to user input [
22]. The segmentation and localization of surgical instrument landmarks were modeled as heatmap regression, and an FCN was used to track the instruments in near real time [
23]. For pulmonary nodule segmentation, Feng
et al. addressed the issue of requiring accurate manual annotations when training FCNs by learning discriminative regions from weakly labeled lung CT with a candidate screening method [
24]. Bai
et al. proposed a self-supervised learning strategy to improve the cardiac segmentation accuracy of U-Net with limited labeled training data [
25].
Registration
Registration is the spatial alignment between two medical images, volumes, or modalities. It is particularly important for both pre- and intraoperative planning. Traditional algorithms usually iteratively calculate a parametric transformation, i.e., an elastic, fluid, or B-spline model, to minimize a given metric, i.e., mean square error, normalized cross correlation, or mutual information, between two medical inputs. Recently, deep regression models were used to replace the traditional time-consuming and optimization-based registration algorithms.
An example of deep learning-based approaches to registration is VoxelMorph, which maximizes standard image-matching objective functions by leveraging a CNN-based structure and auxiliary segmentation to map an input image pair to a deformation field [
26]. An end-to-end deep learning framework was proposed for 3D medical-image registration that consisted of three stages, namely, affine transform prediction, momentum calculation, and non-parametric refinement, to combine affine registration and vector momentum-parameterized stationary velocity field [
27]. A weakly supervised framework was proposed for multimodal image registration, with training on images with a higher level correspondence, i.e., anatomical labels, rather than voxel-level transformation for predicting the displacement field [
28]. A Markov decision process with each agent trained with dilated FCN was applied to align a 3D volume to 2D X-ray images [
29]. RegNet was proposed by considering multiscale contexts and trained on an artificially generated displacement vector field to achieve a nonrigid registration [
30]. A 3D image registration can also be formulated as a strategy-learning process with 3D raw images as the input, the next optimal action (i.e., up or down) as the output, and the CNN as the agent [
31].
AI for intraoperative guidance
Computer-aided intraoperative guidance has always been a cornerstone of MIS. Learning strategies have been extensively integrated into the development of intraoperative guidance to provide enhanced visualization and localization in surgery. Recent works can be divided into four main aspects: shape instantiation, endoscopic navigation, tissue tracking, and augmented reality (AR) (Fig. 2).
3D shape instantiation
For intraoperative 3D reconstruction, 3D volumes can be scanned with MRI, CT, or ultrasound. In practice, this 3D/4D process can be time consuming or produce scans with low resolution. Limiting the number of images needed for 3D shape reconstruction can enable a program to reconstruct a 3D surgical scene in real time, and superior protocols can additionally improve the resolution of the reconstruction. Intraoperative real-time 3D shape instantiation from a single or limited amount of 2D images is an emerging area of research.
For example, a 3D prostate shape was instantiated from multiple nonparallel 2D ultrasound images with a radial basis function [
32]. The 3D shapes of fully compressed, fully deployed, and also partially deployed stent grafts were instantiated from a single projection of 2D fluoroscopy with mathematical modeling combined with the robust perspective-n-point method, graft gap interpolation, and graph neural networks [
33–
35]. Furthermore, an equally weighted focal U-Net was proposed to segment automatically the markers on stent grafts and improve the efficiency of intraoperative stent graft shape-instantiation framework [
36]. Moreover, a 3D AAA skeleton was instantiated from a single projection of 2D fluoroscopy with skeleton deformation and graph matching [
37]. A 3D liver shape was instantiated from a single 2D projection via principal component analysis (PCA), statistical shape model (SSM), and partial least square regression (PLSR) [
38]; this work was further generalized to a registration-free shape instantiation framework for any dynamic organ with sparse PCA, SSM, and kernel PLSR [
39]. Recently, an advanced deep and one-stage learning strategy that estimates 3D point cloud from a single 2D image was proposed for 3D shape instantiation [
40].
Endoscopic navigation
The trend now in surgery is increasingly leaning toward intraluminal procedures and endoscopic surgery driven by early detection and intervention. Navigation techniques have been evaluated to guide the maneuvering of endoscopes toward target locations. To this end, learning-based depth estimation, visual odometry, and simultaneous localization and mapping (SLAM) have been tailored for camera localization and environment mapping with the use of endoscopic images.
Depth estimation
Depth estimation from endoscopic images plays an essential role in 6 DoF camera motion estimation and 3D structural environment mapping, which had been tackled either by supervised [
41,
42] or by self-supervised [
43,
44] deep learning methods. This process is hindered by two main challenges. First, obtaining a large amount of high-quality training data containing paired video images and depth maps is practically difficult because of both hardware constraints and labor-intensive labeling. Second, surgical scenes are often textureless that makes applying depth recovery methods that rely on feature matching and reconstruction difficult [
45,
46].
To address the challenge of limited training data, Ye
et al. [
47] proposed a self-supervised depth estimation approach for stereoimages using siamese networks. For monocular depth recovery, Mahmood
et al. [
41,
42] learned the mapping from rendered RGB images to the corresponding depth maps with synthetic data and adopted domain transfer learning to convert real RGB images into rendered images. Additionally, self-supervised unpaired image-to-image translation [
44] using a modified cycle generative adversarial network (CycleGAN) [
48] was proposed to recover the depth from bronchoscopic images. Moreover, a self-supervised CNN based on the principle of shape from motion was applied to recover the depth and achieve visual odometry for an endoscopic capsule robot [
43].
Visual odometry
Visual odometry uses consecutive video frames to estimate the pose of a moving camera. CNN-based approaches [
49] were adopted for camera pose estimation on the basis of temporal information. Turan
et al. [
49] estimated the camera pose for an endoscopic capsule robot using a CNN for feature extraction and long short-term memory (LSTM) for dynamics estimation. Sganga
et al. [
50] combined ResNet and FCN to calculate the pose change between consecutive video frames. However, the feasibility of localization approaches according to visual odometry was only validated in lung phantom data [
50] and gastrointestinal (GI) tract data [
49].
3D reconstruction and localization
Owing to the dynamic nature of tissues, real-time 3D reconstruction of tissue environment and localization are vital prerequisites for navigation. SLAM is a widely studied technique in robotics. In SLAM, the robot can simultaneously build a 3D map of surrounding environments and localize the camera pose in the built map. Traditional SLAM algorithms are based on the assumption of a rigid environment, which is in contrast to that found in a typical surgical scene where the deformation of soft tissues and organs is involved. Hence, this flawed assumption limits its direct adoption for surgical tasks. To address this limitation, Mountney
et al. [
51] first applied the extended Kalman filter SLAM (EKF-SLAM) framework [
52] with a stereoendoscope, where the SLAM estimation was compensated with periodic motion of soft tissues caused by respiration [
53]. Grasa
et al. [
54] evaluated the effectiveness of monocular EKF-SLAM in hernia repair surgery for measuring hernia defects. Turan
et al.[
55] estimated the depth images from the RGB data through shape from shading. They then adopted the RGB-D SLAM framework by using paired RGB and depth images. Song
et al. [
56] implemented a dense deformable SLAM on a graphics processing unit (GPU) and an ORB-SLAM on a central processing unit (CPU) to boost the localization and mapping performance of a stereoendoscope.
Endovascular interventions have been increasingly utilized to treat cardiovascular diseases. However, visual cameras are not applicable inside vessels. For example, catheter mapping is commonly used in radiofrequency catheter ablation for navigation [
57]. To this end, recent advances in intravascular ultrasound (IVUS) have offered another avenue for endovascular intraoperative guidance. Shi
et al. [
58] first proposed the simultaneous catheter and environment (SCEM) framework for 3D vasculature reconstruction by fusing electromagnetic sensing data and IVUS images. To deal with the errors and uncertainty measured from both EM sensors and IVUS images, they improved SCEM and reconstructed the 3D environment by solving a nonlinear optimization problem [
59]. To alleviate the burden of preregistration between preoperative CT data and EM sensing data, a registration-free SCEM approach was proposed for more efficient data fusion [
60].
Tissue feature tracking
Learning strategies have also been applied to soft tissue tracking in MIS. Mountney and Yang [
61] introduced an online learning framework that updates the feature tracker over time by selecting correct features using decision tree classification. Ye
et al. [
62] proposed a detection approach that incorporates a structured support vector machine (SVM) and online random forest for re-targeting a preselected optical biopsy region on soft tissue surfaces of the GI tract. Wang
et al. [
63] adopted a statistical appearance model to differentiate the organ from the background in their region-based 3D tracking algorithm. Their validation results demonstrated that incorporating learning strategies can improve the robustness of tissue tracking with respect to the deformations and variations in illumination.
Augmented reality
AR improves surgeons’ intraoperative vision by providing a semitransparent overlay of preoperative image on the area of interest [
64]. Wang
et al. [
65] used a projector to project the AR overlay for oral and maxillofacial surgery. A 3D contour matching was used to calculate the transformation between the virtual image and real teeth. Instead of using projectors, Pratt
et al. exploited Hololens, a head-mounted AR device, to project a 3D vascular model on the lower limb of patients [
66]. Given that one of the most challenging tasks is to project the overlay on markerless deformable organs, Zhang
et al. [
67] introduced an automatic registration framework for AR navigation, in which the iterative closet point and RANSAC algorithms were applied for 3D deformable tissue reconstruction.
AI for surgical robotics
Owing to the development of AI techniques, surgical robots can achieve superhuman performance during MIS [
68,
69]. The objective of AI is to boost the capability of surgical robotic systems in perceiving complex
in vivo environments, conducting decision-making, and performing the desired tasks with increased precision, safety, and efficiency. As illustrated in Fig. 3, the common AI techniques used for robotic and autonomous systems (RAS) can be summarized in four aspects: (1) perception, (2) localization and mapping, (3) system modeling and control, and (4) human–robot interaction.
As overlap exists between intraoperative guidance and robot localization and mapping, this section mainly covers the methods for increasing the level of surgical autonomy.
Perception
Instrument segmentation and tracking
Instrument segmentation tasks can be divided into three groups: segmentation to distinguish the instrument from the background, multiclass segmentation of instrument parts (i.e., shaft, wrist, and gripper), and multiclass segmentation for different instruments. The advancement of deep learning in segmentation has remarkably improved the instrument segmentation accuracy from the exploitation of SVM for pixel-level binary classification [
70] to recent DCNN architectures, such as U-Net, TernausNet-VGG11, TernausNetVGG16, and LinkNet, for both binary segmentation and multiclass segmentation [
71]. To further improve instrument segmentation performance, Islam
et al. developed a cascaded CNN with a multiresolution feature fusion framework [
72].
Algorithms for solving tracking problems can be separated into two categories: tracking by detection and tracking via local optimization [
73]. Previous works in this field mainly relied on hand-crafted features, such as Haar wavelets [
73], color or texture features [
74], and gradient-based features [
75]. In the context of deep learning-based methods, the proposed methods were built on the basis of the concept of tracking by detection [
76,
77]. Various CNN architectures, such as AlexNet [
76] and ResNet [
23,
77], were used to detect the surgical tools from RGB images. Sarikaya
et al. [
78] additionally fed the optical flow estimated from color images into the network. LSTM was integrated to smoothen the detection results to leverage spatiotemporal information [
77]. In addition to position tracking, the pose of the articulated end-effector was simultaneously estimated by the methods proposed by Ye
et al. [
75] and Kurmannet al. [
79].
Interaction between surgical tools and environment
A representative example of tool–tissue interaction during surgery is suturing. In this task, the robot needs to recover the 2D or 3D shape of thread from 2D images in real time. Other challenges in this task include the deformation of thread and variations in the environment. Padoy and Hager [
80] introduced a Markov random field-based optimization method to track the 3D thread modeled by a nonuniform rational B-spline.
Recently, a supervised two-branch CNN, called deep multistage detection (DMSD), was proposed for surgical thread detection [
81]. In addition, the DMSD framework was improved with a CycleGAN [
48] structure to perform domain adaption for the foreground and background [
82]. On the basis of adversarial learning, additional synthetic data for thread detection were generated while preserving the semantic information that enabled the transfer of learned knowledge to the target domain.
Estimation of the interaction forces between surgical instruments and tissues can provide meaningful feedback to ensure safe robotic manipulation.
Recent works have incorporated AI techniques in the field of vision-based force sensing, which can accurately estimate the force values from visual inputs. The LSTM-RNN architecture can automatically learn accurate mapping between visual–geometric information and applied force in a supervised manner [
83]. In addition to supervised learning, a semisupervised DCNN was proposed by Marban
et al. [
84], where the convolutional auto-encoder learns the representation from RGB images, followed by minimizing the error between the estimated force and ground truth using LSTM.
System modeling and control
Learning from human demonstrations
Learning from demonstration (LfD), also known as programming by demonstration, imitation learning, and apprenticeship learning, is a popular paradigm for enabling robots to perform autonomously new tasks with learned policies. This paradigm is beneficial for complicated automation tasks, such as surgical procedures, for which surgical robots can autonomously execute specific motions or tasks through learning from surgeons’ demonstrations without tedious programming procedures. The robots can reduce surgeons’ tedium, as well as provide superhuman performance in terms of execution speed and smoothness. The common framework of LfD is to first segment a complicated surgical task into several motion primitives or subtasks, followed by recognition, modeling, and execution of these subtasks sequentially.
Surgical task segmentation and recognition
The JHU-ISI Gesture and Skill Assessment Working Set data set [
85] is the first publicly available benchmark data set for surgical activity segmentation and recognition. This data set contains synchronized video and kinematic data of three subtasks captured from the Da Vinci robot: suturing, needle passing, and knot tying. Unsupervised clustering algorithms are the most popular for surgical task segmentation. Fard
et al. [
86] proposed a soft boundary-modified Gath–Geva clustering algorithm for segmenting kinematic data. A transition state clustering (TSC) method [
87] was presented to exploit both the video and kinematic data to detect and cluster transitions between linear dynamic regimes on the basis of kinematic, sensory, and temporal similarities. The TSC method was further improved by applying DCNNs to extract features from video data [
88]. For surgical subtask recognition, most previous methods [
85,
89,
90] were developed according to variations in hidden Markov model (HMM), conditional random field (CRF), and linear dynamic systems (LDS). Particularly, joint segmentation and recognition frameworks were proposed by Despinoy
et al. [
91] and DiPietro
et al [
92]. DiPietro
et al. [
92] specifically modeled complex and nonlinear dynamics of kinematic data with RNN to recognize both surgical gestures and activities, where the simple RNN, forward LSTM, bidirectional LSTM, gated recurrent unit, and mixed history RNN were compared with traditional methods. Liu and Jiang [
93] introduced a novel method by modeling the recognition task as a sequential decision-making process and trained an agent by RL with hierarchical features from a DCNN model.
Surgical task modeling, generation, and execution
After acquiring the segmented motion trajectories representing surgical subtasks, the dynamic time warping algorithm can be applied to align temporally different demonstrations before modeling. To generate the motion in a new task autonomously, previous works extensively studied Gaussian mixture model (GMM) [
94,
95], Gaussian process regression (GPR) [
96], dynamics model [
97], finite state machine [
98], and RNN [
99] for modeling the demonstrated trajectories. Experts’ demonstrations are encoded by the GMM algorithm, and the parameters of mixture model can be iteratively estimated by the expectation maximization algorithm. With a given GMM, the Gaussian mixture regression is then used to generate the target trajectory of the desired surgical task [
94,
95]. GPR is a nonlinear Bayesian function learning technique that models a sequence of observations generated by a Gaussian process. Osa
et al. [
96] chose GPR for online path planning in a dynamic environment. Given the predicted motion trajectory, different control strategies, e.g., linear–quadratic regulator controller [
97], sliding mode control [
96], and neural network [
100], can be applied to improve robustness in surgical task execution.
Reinforcement learning
In many surgical tasks, reinforcement learning (RL) is another popular machine-learning paradigm to solve problems, such as control of continuum robots, soft tissue manipulation, and tube insertion, that are difficult to model analytically and explicitly observe [
101]. In the learning process, the controller of the autonomous surgical robot, known as an agent, attempts to find the optimized policies that yield highly accumulated reward through iterative interaction with the surrounding environment. The RL environment is modeled as a Markov decision process. The RL algorithm can be initialized with the learned policies from expert demonstrations instead of learning from scratch to reduce efficiently the learning time [
95,
102,
103]. Tan
et al. [
103] trained a generative adversarial imitation learning [
104] agent to imitate latent patterns existing in human demonstrations. This agent can deal with mismatched distributions caused by multimodal behaviors. DRL with advanced policy search methods allowed robots to execute autonomously a wide range of tasks [
105]. However, repeating these experiments on a surgical robotic platform over a million times is unrealistic. To this end, the agent can be first trained in a simulation environment and then transferred to a real robot [
106]. The agent can first learn tensioning policies from a finite-element simulator via DRL, and then it can be transferred to a real physical system. However, the discrepancy between simulation data and real-world environment needs to be reconciled.
Human–robot interaction
Human–robot interaction (HRI) is a field that integrates knowledge and techniques from multiple disciplines to build effective communication between humans and robots. With the help of AI, surgical task-oriented HRI allows surgeons to control cooperatively surgical robotic systems with touchless manipulation. Interaction media between surgeons and intelligent robots are usually through surgeons’ gaze, head movement, speech/voice, and hand gestures. By understanding the intention of humans, robots can then perform the most appropriate actions that satisfy the surgeons’ needs.
Tracking 2D/3D eye-gaze points of surgeons can assist surgical instrumental control and navigation [
107]. For surgical robots, the eye-gaze contingent paradigm can facilitate the transmission of images and enhance procedure performance, thereby enabling more accurate navigation of the instruments [
107]. Yang
et al. [
108] first introduced the concept of gaze-contingent perceptual docking for robot-assisted MIS, in which the robot can learn the operators’ specific motor and perceptual behavior through their saccadic eye movements and ocular vergence. Inspired by this idea, Visentini-Scarzanella
et al. [
109] used the gaze-contingent docking to reconstruct the surgeon’s area of interest with a Bayesian chains method in real time. Fujii
et al. [
110] performed gaze gesture recognition with the HMM so as to pan, zoom, and tilt the laparoscope during surgery. In addition to human gaze, surgeons’ head movements can also be used to control remotely a laparoscope or endoscope [
111,
112].
Robots have the potential to interpret human intentions or commands through voice commands. However, robot assistance during surgery remains challenging because of the noisy environment in the operation room. With the development of deep learning in speech recognition, the precision and accuracy of speech recognition have considerably improved [
113], thus leading to more reliable control of the surgical robot [
114].
Hand gesture is another popular medium in different HRI scenarios. Learning-based real-time hand gesture detection and recognition methods have been developed by harnessing different sensors. Jacob
et al. [
115] designed a robotic scrub nurse, called Gestonurse, to understand nonverbal hand gestures. They used the Kinect sensor to localize and recognize different gestures generated by surgeons to help in delivering surgical instruments to surgeons. Wen
et al. introduced an HMM-based hand gesture recognition method for AR control [
116]. More recently, vision-based hand gesture recognition with high precision [
117] can be achieved with the help of deep learning. This development can therefore substantially improve the HRI safety in surgery.
Conclusion and future outlook
The advancement in AI has been transforming modern surgery toward more precise and autonomous intervention for treating both acute and chronic symptoms. By leveraging such techniques, notable progress has been achieved in preoperative planning, intraoperative guidance, and surgical robotics. Herein, we summarize the major challenges for these three aspects (Fig. 4). We then discuss achievable visions of future research directions. Finally, we further examine other key issues, such as ethics, regulation, and privacy.
Preoperative planning
Deep learning has been widely adopted in preoperative planning for tasks ranging from anatomical classification, detection, and segmentation to image registration. The results seem to suggest that deep learning-based methods can outperform those which rely on conventional approaches. However, data-driven approaches often suffer from inherited limitations, making deep learning-based approaches less generalizable, explainable, and more data-demanding.
To overcome these issues, close collaborations between multidisciplinary teams should be encouraged, particularly between surgeons and AI researchers, to generate large-scale annotated data that will provide more training data for AI algorithms. An alternative solution is to develop AI techniques, such as meta-learning, or learning to learn, that enable generalizable systems to perform diagnosis with limited data sets yet improved explainability.
Although many state-of-the-art machine-learning and deep-learning algorithms have made breakthroughs in the field of general computer vision, the differences between medical and natural images can be massive and thus may impede their clinical applicability. In addition, the underlying models and the derived results may not be easily interpretable by humans, a condition that raises important issues, such as potential risks and uncertainty in surgery. Potential solutions to these problems would be to explore different transfer learning techniques to mitigate the differences between image modalities and develop more explainable AI algorithms to enhance its decision-making performance.
Furthermore, utilizing personalized multimodal patient information, including omics-data and lifestyle information, in AI development can be useful in early detection and diagnosis, thereby leading to personalized treatment. These improvements also allow early treatment options that result in minimal trauma, low surgical risks, and short recovery time.
Intraoperative guidance
AI techniques have already contributed to more accurate and robust intraoperative guidance for MIS. 3D shape instantiation, camera pose estimation, and dynamic environment tracking and reconstruction have been tackled to assist various surgical interventions.
The key points in developing computer-assisted guidance from visual observations should be improving the localization and mapping performance with textureless surfaces, variation in illumination, and limited field of view.
Another major challenge is organ/tissue deformation that complicates surgery with a dynamic and uncertain environment despite extensive preoperative planning. Although AI technologies have succeeded in detection, segmentation, tracking, and classification, further studies are warranted to extend these processes to more sophisticated 3D applications. Additionally, during a surgery, an important requirement for an AI algorithm is its ability to assist surgeons in real time efficiently. Such demands have been encountered in developing AR or VR where frequent interactions are required either between surgeons and autonomous guidance systems or during remote surgery involving multidisciplinary teams located in different geographical locations.
Aside from visual information, future AI technologies must fuse multimodal data from various sensors to achieve more precise perception of the complicated environment. Furthermore, increasing the use of micro- and nanorobotics in surgery will generate new guidance issues.
Surgical robotics
With the integration of AI, surgical robotics would be able to perceive and understand complicated surroundings, conduct real-time decision-making, and perform surgical tasks with increased precision, safety, automation, and efficiency. For instance, current robots can already automatically perform some simple surgical tasks, such as suturing and knot tying [
118,
119]. Nevertheless, the increased level of robotic autonomy for more complicated tasks can be achieved by advanced LfD and RL algorithms, especially when considering interactions with dynamic environments. Owing to the diversity of surgical robotic platforms, generalized learning for accurate modeling and control is also required.
Most of the current surgical robots are expensive, bulky, and can only perform master–slave operations. We emphasize that a more versatile, lighter, and probably cheaper robotic system should be developed so that it can access more constrained regions during MIS [
2]. Certainly, it also needs to be easily integrated in well-developed surgical workflows so that the robot can seamlessly collaborate with human operators. To date, the current technologies in RAS are still far from achieving full autonomy; human supervision would remain to ensure safety and high-level decision-making.
In the near future, intelligent micro- and nanorobots for noninvasive surgeries and drug delivery could be realized. Furthermore, with the data captured during preoperative examinations, robots could also assist in the manufacturing of personalized 3D bioprinted tissues and organs for transplant surgery.
Ethical and legal considerations of AI in surgery
Beyond precision, robustness, safety, and automation, we must also carefully consider the legal and ethical issues related to AI in surgery. These issues include the following: (1) privacy: patients’ medical records, genetic data, illness prediction data, and operation process data must be protected with high security. (2) Cybercrime: the negative effects on patients should be minimized when failures happen in AI-based surgical systems, which should be verified and certificated while considering all possible risks. (3) Concerned parties should adhere to a code of ethics to ensure that new technologies, such as gene editing and bioprinted organ transplant for long-term human reproduction, are used responsibly and to build trust between humans and AI techniques gradually.
In conclusion, we still have a long way to go to replicate and match in robotic surgery the level of intelligence that surgeons display. AIs that can learn complex tasks on their own and with a minimum of initial training data will prove critical for next-generation systems [
120]. Here we quote some of the questions raised by Yang
et al. in their article on Medical Robotics [
121]:
“As the capabilities of medical robotics following a progressive path represented by various levels of autonomy evolve, most of the role of the medical specialists will shift toward diagnosis and decision-making. Could this shift also mean that medical specialists will be less skilled in terms of dexterity and basic surgical skills as the technologies are introduced? What would be the implication on future training and accreditation? If robot performance proves to be superior to that of humans, should we put our trust in fully autonomous medical robots?” Clearly, various issues must be addressed before AI can be more seamlessly integrated in the future of surgery.