1 Introduction
Medical image segmentation aims to predict the semantic interpretation of each pixel, and is one of the crucial tasks in clinical image analysis. Accurate and reliable automatic segmentation is required to quickly provide clinicians with assistance and advice to improve the efficiency of clinical workflows.
In the past decades, due to the emergence of deep learning [
1–
9], medical image segmentation [
10–
16] has witnessed substantial progress. The pioneer U-Net [
10] presents skip-connection that effectively fuses the shallow texture information and deep semantic information, achieving very promising medical image segmentation results in most cases. Though many variants of the U-Net [
11–
14] have then been proposed, U-Net still remains the popular
de facto network for medical images, achieving relatively good results compared to many of its alternatives [
17].
Inspired by the success of vision transformer (ViT) in image recognition [
18] and semantic segmentation [
19,
20], some transformer-based networks have been proposed for medical image segmentation. Specifically, Chen et al. [
21] propose TransUNet that replaces the encoder in U-Net with a hybrid of ResNet and ViT, while keeping the original decoder of U-Net for medical image segmentation. Based on the popular Swin-transformer [
22], Swin-Unet [
23], a UNet-like pure transformer architecture has been proposed. Both TransUNet and Swin-Unet achieve very encouraging results in medical image segmentation.
Most existing methods frame the medical image segmentation problem as an individually pixel-wise classification task, and ignore that medical objects usually have some specific shapes. This may lead to anatomically implausible segmentation results, in particular for images of unseen datasets. Recently, some methods incorporating the shape information [
24–
31] to make the segmentation results more anatomically correct have been proposed. For instance, some methods [
24–
27] learn extra shape-related information and fuse shape feature and segmentation feature. Some other methods [
28–
31] leverage the encoder-decoder network on both predicted and ground-truth segmentation to align their latent non-linear representation, and refine the predicted segmentation through auto-encoder.
Incorporating shape information has been shown to improve the performance of medical image segmentation. Most existing methods [
24–
38] force the network to learn and make use of the shape feature by predicting extra shape-related information or aligning non-linear representation of the segmentation result and ground truth. However, these methods often necessitate additional computations to fuse the learned shape features or a post-processing step to refine the segmentation. Furthermore, these methods fail to incorporate intensity information, which has been demonstrated to be valuable prior knowledge for medical image segmentation [
39,
40].
In this paper, different from the existing methods, we propose incorporating joint shape-intensity information to enhance the model’s segmentation performance of the model, while simultaneously improving the model’s generalization capabilities on unseen datasets. This is achieved by novelly leveraging knowledge distillation [
41]. Specifically, we first train a segmentation network on the class-wise averaged training images without texture information. This segmentation network encodes the useful shape-intensity knowledge, and is regarded as the teacher network. We then train a student segmentation network with the same network architecture as the teacher network on the original training images. In addition to the classical segmentation loss, we also apply a distillation loss on the penultimate layer between the teacher and student network. In this way, the student network effectively learns shape-intensity information, leading to more plausible intra-dataset and cross-dataset segmentation results (see Fig.1). The student network is considered as the final segmentation network, which does not require any extra computation cost during inference, making it efficient in practical usage.
Fig.1 Comparison of averaged performance between the proposed SIKD and corresponding baseline models (including U-Net, SAUNet, PraNet, SANet, TransUnet, MaxStyle, SAMed, LM-Net and 2D D-LKA) under intra-dataset and cross-dataset evaluation on five medical image segmentation tasks of different modalities. The result is the average value across all baseline methods and the corresponding SIKD |
Full size|PPT slide
The main contribution of this paper is: 1) We novelly propose to train a network (i.e., teacher network) on the class-wise averaged training images. This simple design explicitly extracts shape-intensity prior information; 2) We then leverage knowledge distillation to transfer the shape-intensity knowledge learned by the teacher model to the student network (i.e., final segmentation network), effectively incorporating shape-intensity prior information for medical image segmentation; 3) Extensive experiments on five medical image segmentation tasks of different modalities, demonstrate that the proposed Shape-Intensity Knowledge Distillation (SIKD) consistently/significantly improves the baseline models and has a better generalization ability to images of unseen datasets.
The rest of this paper is organized as follows. In Section 2, we review some related works on medical image segmentation incorporating shape information and knowledge distillation. We then detail the proposed method in Section 3, followed by extensive experimental results in Section 4 and some discussions in Section 5. Finally, we conclude and give some perspectives in Section 6.
2 Related work
Since the pioneer work of U-Net [
10], there are many U-Net-based works on medical image segmentation [
11–
14,
17,
42,
43]. A detailed review of recent methods can be found in [
44]. In this section, we mainly focus on shortly reviewing some related works on deep-learning-based medical image segmentation incorporating shape information and knowledge distillation. More details about the shape-aware medical image segmentation and knowledge distillation are referred to [
45–
47].
2.1 Shape-aware medical image segmentation
Many methods [
24–
38] are proposed to incorporate the shape prior information into segmentation network, and achieve more accurate segmentation results.
Some methods [
24–
27] rely on the adoption of additional losses to learn extra shape-related targets, which are typically based on the object boundary. For instance, SAUNet [
24] adds a shape stream (supervised by ground-truth boundary) to the texture stream of U-Net with a decoder replaced by the spatial and channel-wise attention paths. A gated convolution layer is then used to fuse shape and texture information for segmentation. In [
25], the authors propose a loss based on the segment-level shape similarity that measures the curve similarity between each ground-truth boundary and corresponding predicted boundary segmentation, requiring no extra runtime during inference. AtrialJSQnet [
26] predicts an additional distance map with respect to the boundary to incorporate spatial and shape information, without introducing additional inference time. In addition to classical segmentation and edge loss, SMU-Net [
27] also adopts shape-aware loss characterizing the distance to the nearest object boundary and position-aware loss reflecting the distance to the object center.
Some methods [
28–
33] resort to apply an auto-reconstruction network with encoder-decoder architecture on the predicted segmentation and the ground-truth annotation. The encoder projects segmentation result and annotation to latent non-linear representation, on which they align the prediction and ground-truth segmentation. The decoder is often used to refine the predicted segmentation [
29–
33], which heavily relies on the ground-truth mask degradation strategy to train the auto-reconstruction network. Specifically, [
29] and ACNN-Seg [
28] share a similar idea of leveraging such an auto-reconstruction pipeline. The latter does not rely on the decoder to refine segmentation results, and thus needs no additional runtime during inference. Based on the similar auto-reconstruction network, [
32] propose an anatomically-constrained rejection sampling procedure to augment the latent representation, and warp anatomically invalid segmentation toward the closest anatomically plausible one. Post-DAE [
30] leverages denoising autoencoders to post-process the segmentation result to anatomically plausible segmentation. Chen et al. [
31] further propose a hard example generation in the latent space of the segmentation network to generate diverse training images and corrupted segmentation results, reinforcing the performance of refined segmentation. LFB-Net [
33] shares the same decoder between the segmentation network and auto-reconstruction network, and fuses the latent feature of both segmentation and auto-reconstruction network.
Adversarial learning is also used to integrate shape information [
48,
49], where the segmentation network is regarded as the generator. The core idea is to generate segmentation result that confuses the discriminator in discriminating the ground-truth segmentation and predicted segmentation. Jointly training the discriminator and segmentation network helps to yield more plausible segmentation results, without introducing any extra runtime during inference.
Some other methods [
34–
37] focus on fusing/learning prior shape characteristics or segmenting objects of specific shapes (e.g., star and circular shape). For instance, The method in [
34] fuses prior shape information about the distribution of semantic class over the image domain (statistically computed on the training set) into a segmentation network. Tilborghs et al. [
35] directly learn shape parameters of an underlying shape model statistically computed on the set of training images, avoiding anatomically implausible segmentation. Mirikharaji and Hamarneh [
36] introduce star shape regularized loss term to segment star shape skin lesion, which does not require additional inference time. Guo et al. [
37] develop a globally optimal label fusion (GOLF) algorithm that frames the predicted segmentation and “nesting”/circular shape priors into a normalize cut framework, optimized by the proposed max-flow algorithm.
The existing methods that incorporate shape information achieve improved segmentation results. Most of them learn and fuse shape features guided by shape-related supervision, or rely on time-consuming auto-reconstruction to refine segmentation results. Some others devote to leveraging statistical shape models to take into account shape prior information. These methods often ignore the intensity prior information, which is also valuable for medical image segmentation [
39,
40]. Differently, the proposed SIKD incorporates joint shape-intensity into deep-learning-based medical image segmentation. Besides, we also novelly resort to knowledge distillation to transfer the shape-intensity knowledge to the segmentation network. This further boosts the segmentation performance and generalization ability to unseen images.
2.2 Knowledge distillation
Knowledge distillation (KD) [
41] generally refers to transferring knowledge from a pre-trained teacher model to a student model, to improve the performance of the student network. Since the pioneer work [
41], many methods [
50–
55] have been proposed for efficient image classification. Hinton et al. [
41] firstly proposes that an efficient compact model could be obtained by transferring the knowledge of the cumbersome model to the compact model. Ba and Caruana [
50] also demonstrate through a series of experiments that lightweight networks could learn complex functions previously learned by deep networks. Tian et al. [
51] suggest that previous KD methods ignore important structural knowledge of the teacher network, and propose to leverage contrastive learning to transfer structural knowledge. Wang et al. [
52] discover that it is better to distillate the knowledge lying in the penultimate layer of the teacher network than mimicking the teacher’s soft logits. Besides, the authors also propose in [
52] to adopt locality-sensitive hashing (LSH) to make the student focus more on mimicking the feature direction than feature magnitude. REFILLED [
53] extends general knowledge distillation by applying both intra-image relational KD and cross-image relational KD.
With the increasing efficiency requirement and hardware device limitation, the idea of model compression via KD is gradually applied to other vision tasks, such as semantic segmentation [
56–
61]. For instance, in addition to classical pixel-wise distillation loss, Liu et al. [
56] propose pair-wise distillation and holistic distillation (via discriminating the segmentation of teacher and student network as real and fake, respectively) loss, to make the student produce better structured segmentation. Wang et al. [
57] propose to compute the class-wise feature prototype (class-wise averaged feature) and then leverage knowledge distillation to transfer the intra-class feature variation of the cumbersome teacher model to the compact student model. Qin et al. [
58] propose to distillate regional affinity between class-wise averaged feature prototype for efficient medical image segmentation. Shu et al. [
59] improve KD-based segmentation by normalizing the activation map of each feature channel to a soft probability map before minimizing the Kullback–Leibler (KL) divergence between the teacher and student model. This makes the distillation process pay more attention to the most salient feature, which is valuable for segmentation. The adaptive perspective distillation [
60] distills inter- and intra- distribution of cosine similarity between the adapted feature and adapted feature prototype. CIRKD [
61] distills the cross-image relational knowledge to transfer global pixel relationships between images for segmentation.
KD is also widely used for cross-modal image analysis [
62–
65]. For example, Gupta et al. [
62] propose feature mimicking between teacher and student model (with the same network architecture) for transferring learned representation from a large labeled modality to a new unlabeled modality, enabling to learn rich representation from unlabeled modalities. In [
63], the authors train the teacher model on concatenated multi-modal images, and leverage KD to transfer multi-modal knowledge to mono-modal segmentation network. Dou et al. [
64] minimize KL-divergence of the confusion matrix of class-wise averaged soft logits between different modalities to achieve unpaired multi-modal segmentation via KD. Li et al. [
65] propose an online mutual knowledge distillation for cross-modality medical image segmentation, where the segmentor on each modality explores the knowledge of another modality via mutual KD.
Most KD methods transfer knowledge between different networks or modalities. There are also some self-distillation methods [
66,
67] that transfer knowledge of deep layers to shallow layers of the same network applied on the same modality. Specifically, SAD [
66] performs top-down and layer-wise activation-based attention distillation within the network itself, achieving effective lightweight lane detection. Zhang et al. [
67] apply several shallow classifiers at different shallow layers of the same network, and make shallow classifiers mimic the deep classifier.
The existing KD methods are mainly developed for efficient image classification and semantic segmentation, cross-modal image analysis. Differently, we novelly leverage KD for incorporating shape-intensity knowledge for medical image segmentation. Both the teacher and student model have the same network architecture, trained on class-wise averaged images without texture information and original training images, respectively. The most related works [
64,
65] adopt KD to transfer common shape knowledge between different medical image modalities. The proposed SIKD focuses on extracting joint shape-intensity knowledge from single modality, and incorporating such prior information to achieve accurate and robust medical image segmentation of the underlying modality.
3 Proposed method
3.1 Motivation
Medical image segmentation has recently witnessed great progress thanks to the development of deep learning. Most methods frame the problem as individually pixel-wise classification, and adopt U-Net [
10] or its various alternatives to learn feature representation for pixel-wise classification. Such classical learning-based segmentation scheme usually ignores the fact that the object of interest in medical images generally has a specific shape, and the intensity prior information is also useful for medical image segmentation. Indeed, it is usually easier to segment objects of relatively smooth appearance than textured objects from an image (see Fig.2 for an example). The smoothed object mainly contains specific shape and homogeneous intensity information other than texture information.
Fig.2 The pipeline of the proposed method. The teacher and student model have the same network architecture, trained on class-wise averaged training images and original images with segmentation loss, respectively. For the student model, we also apply the distillation loss on the penultimate layer between the teacher and student model to transfer the shape-intensity knowledge |
Full size|PPT slide
To incorporate the shape-intensity information into the medical image segmentation, we propose to first extract the shape-intensity knowledge from class-wise averaged training images, then transfer such prior knowledge using knowledge distillation [
41]. Specifically, for each training image
, we first compute the class-wise averaged image
, which has only the shape-intensity information. We then feed
to the teacher segmentation network to extract the shape-intensity information. Knowledge distillation is then adopted to transfer the prior knowledge to the student segmentation network, which has the same network architecture as the teacher network but with the original training images as inputs. In this way, the student model effectively acquires shape-intensity knowledge and is regarded as the final segmentation model. The proposed Shape-Intensity Knowledge Distillation (SIKD) pipeline is depicted in Fig.2.
3.2 Shape-intensity knowledge encoding
The object of interest in medical images is usually intensity inhomogeneous, and contains some texture information. On the other hand, objects in medical images often have a specific shape. Therefore, medical image segmentation can benefit from the shape and intensity information. To encode such shape-intensity information, we propose to apply a segmentation network on the class-wise averaged training images. More precisely, for a given image , we calculate the class-wise averaged image by taking the mean value of the pixel values within each class. Formally, let denote the set of pixels belonging to the th class. For each pixel , the class-wise averaged image is given by:
where denotes the cardinality. It is noteworthy that when the input image is in RGB format, we calculate the class-wise averaged value separately for each channel. Such class-wise averaged image has two distinct characteristics: 1) It does not contain any texture information (shown in Fig.2); 2) Its pixel value distribution is consistent with that of the original image (shown in Fig.3). Thanks to these two characteristics, training a segmentation network on this class-wise averaged image enables the network to learn shape-intensity information. We regard such a trained segmentation network as the teacher network.
Fig.3 Distribution of pixel intensity within each class (e.g., RV cavity, Myocardium, and LV cavity) of the original images (a) and class-wise averaged images (b) on the ACDC training dataset [68]. |
Full size|PPT slide
We train the teacher network on the class-wise averaged training images with classical segmentation loss . Specifically, for the U-Net based SIKD, we adopt cross-entropy loss. For the other alternatives of SIKD, we adopt the same cross-entropy loss and Dice loss with the corresponding baseline models as the segmentation loss.
3.3 Shape-intensity knowledge distillation
As introduced in Section 2.2, KD is usually used to transfer knowledge from a cumbersome convolutional neural network (teacher) to a compact convolutional neural network (student) by aligning the representation of some layers of two networks. In this way, the compact student network is equipped with the powerful feature representation of the teacher network, which helps to improve the performance of the student network. Unlike these classical knowledge distillation frameworks, we propose to leverage knowledge distillation to transfer the shape-intensity information extracted by the teacher network described in Section 3.2. Specifically, the student network regarded as the final segmentation network has the same network architecture as the teacher network. We train the student network on the original training images. In addition to the classical segmentation loss function, we also adopt a distillation loss to align the feature of the penultimate layer of both the teacher and student network. Let and denote the feature of the penultimate layer (before the segmentation layer) of the teacher and student network, respectively. The adopted KD loss is given by:
where MSE() is the Mean-Squared Error loss function. This distillation loss facilitates the shape-intensity knowledge transfer from the teacher segmentation network to the student segmentation network.
For the student network trained on the original training images, the whole training objective function consists of the same segmentation loss as the teacher segmentation network and the KD loss defined in Eq. (2). Formally, is given by:
where is a hyperparameter (set to 2 in all experiments) to control the contribution of segmentation loss and distillation loss term. It is noteworthy that the teacher segmentation network and the KD process are only involved in the training phase. Therefore, during the inference phase, the segmentation network does not require any extra runtime and memory cost.
4 Experiments
We conduct intra- and cross-dataset experiments on five different segmentation tasks of different medical imaging modalities to demonstrate the effectiveness of the proposed SIKD. The test set of intra-dataset serves as a kind of validation set to select the model. We mainly focus on cross-dataset segmentation performance using the selected model.
4.1 Datasets and evaluation protocols
Cardiac segmentation in MRI images Automated Cardiac Diagnosis Challenge (ACDC) [
68] releases 100 annotated MRI volumes obtained from 100 different patients. We randomly divide the dataset into 7:1:2 for training, validation, and test, respectively. We also evaluate the corresponding models on the training set of M&Ms dataset [
69] to evaluate the generalization ability of different methods. M&Ms releases 150 annotated training images from two different MRI vendors.
Multi-organ segmentation in CT images The Synapse multi-organ segmentation dataset [
70] contains 30 abdominal CT scans, where 18 (
resp. 12) cases are used for training (
resp. testing). The goal is to segment 8 abdominal organs. We also conduct evaluations on the AbdomenCT-1K dataset [
71] containing more 1000 CT scans from 12 medical centers to benchmark the generalization ability of different methods.
Polyp segmentation in colonoscopy images Following the experimental setup in [
72], we conduct experiments on five public datasets for colorectal polyp segmentation, including Kvasir [
73], CVC-ClinicDB [
74], CVC-ColonDB [
75], ETIS [
76], and Endoscene [
77]. We randomly split Kvasir and CVC-ClinicDB dataset into 4:1 for training and testing, respectively. The CVC-ColonDB, ETIS [
76], and Endoscene [
77] are used for assessing the generalization ability of different methods. Note that the Endoscene dataset is a combination of CVC-ClinicDB and CVC-300. We only evaluate the model on the test set of CVC-300 denoted as CVC-T.
Optic nerve head segmentation in color fundus images The REFUGE challenge dataset [
78] contains 400 color fundus images (CFI), randomly divided into 4:1 for training and testing, respectively. To further demonstrate the generalization ability of the proposed SIKD, we also evaluate different models (trained on REFUGE dataset) on the public Drishti-GS [
79] dataset consisting of 101 images and RIM-ONE-r3 [
80] dataset containing 159 images.
Breast tumor segmentation in ultrasound images The BUSI dataset [
81] consists of a total of 780 images, including 487 benign images, 210 malignant images, and 133 normal images. This dataset is randomly split into 4:1 for training and testing. We also conduct the cross-dataset evaluation on the dataset used in [
82] and the dataset B [
83] to benchmark the generalization ability of different methods. The dataset in [
82] is composed of 42 breast ultrasound (BUS) images. The dataset B [
83] consists of 163 images corresponding to 110 benign and 53 malignant lesion images.
Evaluation protocols We adopt three metrics: Dice score (Dice), Intersection over Union (IoU), and Hausdorff Distance (HD), which are widely used metrics in medical image segmentation to evaluate the proposed SIKD.
4.2 Implementation details
Since the major goal of this paper is not to obtain state-of-the-art results on each segmentation task, but to show that the proposed SIKD is an effective anatomy-aware segmentation method by incorporating shape-intensity information, which contributes to robust medical image segmentation. We simply adopt several open-sourced segmentation networks on each segmentation task, including the most widely used U-Net [
10] for medical image segmentation, PraNet [
72], SANet [
84], TransUNet [
21], MaxStyle [
85], LM-Net [
86], 2D D-LKA Net [
87], and recent SAMed [
88], which is based on the vision foundation model SAM [
89]. We also implement the proposed SIKD on the baseline (BL) of SAUNet [
24], one of the very few shape-aware medical image segmentation methods that release the source code. Note that both the teacher and student networks have exactly the same network architecture, except for the U-Net and TransUNet, where the skip-connection is discarded for the teacher segmentation network. This is because segmenting the class-wise averaged training images is a relatively simple task. Using the skip-connection may prevent the learning of shape-intensity information for the penultimate layer of the teacher model. All the experiments are conducted using the PyTorch framework on a workstation with a NVIDIA GeForce RTX 3090 GPU (24 GB).
4.3 Experimental results
Cardiac segmentation We first simply build the proposed SIKD using the de facto U-Net for medical image segmentation. Some qualitative results are illustrated in Fig.4. SIKD achieves anatomically correct segmentation results, and alleviates the problem of intra-class inconsistency for the baseline U-Net under both intra-dataset and cross-dataset settings. The quantitative comparison is given in Tab.1. SIKD outperforms the corresponding baseline model.
Fig.4 Some results of the proposed SIKD built upon the baseline U-Net on the cardiac segmentation (top two rows: intra- and cross-dataset) and some qualitative illustration of SIKD built on the baseline TransUNet for multi-organ segmentation (bottom two rows: intra- and cross-dataset) |
Full size|PPT slide
Tab.1 Quantitative comparison between the baseline models and the proposed SIKD on ACDC dataset [68] and M&Ms dataset [69] for cardiac segmentation in MRI image. (resp. ) indicates the higher (resp. lower) the better |
Method | Intra-dataset: ACDC | | Cross-dataset: M&Ms |
Average | | LV | | Myo | | RV | | Average | | LV | | Myo | | RV |
DICE | HD | | DICE | HD | | DICE | HD | | DICE | HD | | DICE | HD | | DICE | HD | | DICE | HD | | DICE | HD |
U-Net [10] | 0.884 | 1.30 | | 0.905 | 1.47 | | 0.852 | 1.24 | | 0.895 | 1.20 | | 0.686 | 4.08 | | 0.733 | 3.26 | | 0.641 | 4.88 | | 0.684 | 4.09 |
U-Net w/ SDM [90] | 0.867 | 1.54 | | 0.914 | 1.25 | | 0.837 | 2.09 | | 0.850 | 1.29 | | 0.703 | 4.54 | | 0.753 | 4.65 | | 0.658 | 6.74 | | 0.697 | 3.22 |
U-Net + SIKD | 0.894 | 1.21 | 0.925 | 1.29 | 0.861 | 1.10 | 0.895 | 1.23 | 0.785 | 2.95 | 0.831 | 2.82 | 0.741 | 3.57 | 0.783 | 2.48 |
Baseline (BL) in [24] | 0.870 | 2.49 | | 0.913 | 2.33 | | 0.837 | 3.62 | | 0.860 | 1.52 | | 0.729 | 7.77 | | 0.796 | 9.73 | | 0.662 | 7.53 | | 0.728 | 6.05 |
SAUNet [24] | 0.880 | 2.05 | 0.910 | 1.88 | 0.845 | 2.31 | 0.883 | 1.96 | 0.739 | 4.57 | 0.806 | 5.93 | 0.672 | 4.21 | 0.738 | 3.56 |
BL + SIKD | 0.892 | 1.58 | 0.924 | 1.36 | 0.856 | 1.93 | 0.897 | 1.45 | 0.766 | 3.73 | 0.816 | 4.39 | 0.733 | 3.67 | 0.750 | 3.13 |
MaxStyle [85] | 0.876 | 2.23 | | 0.914 | 2.56 | | 0.843 | 2.28 | | 0.872 | 1.85 | | 0.818 | 3.84 | | 0.826 | 3.17 | | 0.780 | 5.62 | | 0.847 | 2.74 |
MaxStyle + SIKD | 0.888 | 1.95 | 0.923 | 2.05 | 0.857 | 1.87 | 0.883 | 1.92 | 0.827 | 2.85 | 0.833 | 2.81 | 0.792 | 3.13 | 0.857 | 2.62 |
SAMed [88] | 0.890 | 1.15 | | 0.924 | 1.11 | | 0.853 | 1.12 | | 0.894 | 1.22 | | 0.826 | 2.38 | | 0.862 | 2.27 | | 0.783 | 2.39 | | 0.833 | 2.46 |
SAMed + SIKD | 0.893 | 1.10 | 0.929 | 1.04 | 0.855 | 1.11 | 0.895 | 1.16 | 0.842 | 2.33 | 0.873 | 2.29 | 0.799 | 1.89 | 0.855 | 2.81 |
LM-Net [86] | 0.881 | 1.53 | | 0.922 | 1.21 | | 0.852 | 1.44 | | 0.867 | 1.93 | | 0.742 | 3.17 | | 0.828 | 3.11 | | 0.713 | 3.97 | | 0.684 | 2.43 |
LM-Net + SIKD | 0.900 | 1.33 | 0.933 | 1.28 | 0.873 | 1.41 | 0.894 | 1.11 | 0.802 | 3.12 | 0.866 | 2.85 | 0.774 | 2.70 | 0.767 | 3.80 |
2D D-LKA Net [87] | 0.898 | 2.06 | | 0.929 | 1.69 | | 0.875 | 1.09 | | 0.889 | 3.39 | | 0.824 | 2.77 | | 0.881 | 2.03 | | 0.778 | 1.84 | | 0.813 | 4.43 |
2D D-LKA + SIKD | 0.902 | 1.91 | 0.933 | 1.51 | 0.875 | 1.23 | 0.896 | 2.98 | 0.854 | 2.21 | 0.905 | 1.75 | 0.822 | 1.36 | 0.835 | 3.51 |
We then benchmark the generalization ability of different methods for automatic cardiac segmentation in MRI images. As depicted in Tab.1, when generalizing the models trained on the ACDC dataset to M&Ms dataset, SIKD built on U-Net boosts the performance of the baseline model by 9.9% Dice score and 1.13 mm HD. SIKD also shows improvements over SAUNet and U-Net w/ SDM [
90] under this cross-dataset evaluation. Besides, building our SIKD upon recent MaxStyle [
85], SAMed [
88], LM-Net [
86], and 2D D-LKA Net [
87] with strong generalization capabilities further consistently enhances their cross-dataset segmentation performance. The quantitative benchmark on cardiac segmentation confirms that SIKD is effective in incorporating the shape-intensity information, thus enhancing the generalization ability of medical image segmentation.
Multi-organ segmentation For multi-organ segmentation in CT images, Some illustrative results are shown in Fig.4. Qualitatively, SIKD built on TransUNet [
21] accurately segments different organs and preserves their shapes well. The quantitative benchmark on the Synapse multi-organ segmentation dataset is depicted in Tab.2. SIKD performs better than the corresponding baseline in terms of both Dice score and HD, implying that SIKD achieves good surface prediction and preserves the shapes better. Specifically, SIKD outperforms the baseline TransUNet by 2.69% Dice score and 8.04 mm HD. Compared with SwinUnet [
23], SIKD based on TransUNet achieves an improvement of 1.01% Dice score and 1.89 mm HD. Implementing SIKD with MaxStyle [
85] and SAMed [
88] also consistently boosts the intra-dataset segmentation performance. The latest 2D D-LKA achieved state-of-the-art performance. Based on this approach, our method improves the Dice score by more than 1%.
Tab.2 Comparison on the Synapse multi-organ segmentation dataset [70] and the AbdomenCT-1K dataset [71] |
Method | Intra-dataset | | Cross-dataset |
Synapse | | AbdomenCT-1K |
DICE | HD | | DICE | HD |
U-Net [10] | 0.760 | 43.64 | | 0.468 | 94.32 |
U-Net w/ SDM [90] | 0.392 | 100.35 | 0.257 | 152.11 |
U-Net + SIKD | 0.780 | 30.05 | 0.487 | 94.23 |
Baseline (BL) in [24] | 0.782 | 32.47 | | 0.556 | 92.61 |
SAUNet [24] | 0.790 | 31.38 | 0.598 | 86.34 |
BL + SIKD | 0.792 | 24.61 | 0.670 | 74.13 |
SwinUnet [23] | 0.791 | 21.55 | | − | − |
TransUNet [21] | 0.775 | 27.70 | 0.695 | 75.97 |
TransUNet + SIKD | 0.801 | 19.66 | 0.715 | 71.48 |
MaxStyle [85] | 0.757 | 26.33 | | 0.664 | 52.37 |
MaxStyle + SIKD | 0.771 | 21.76 | 0.691 | 43.15 |
SAMed [88] | 0.816 | 21.78 | | 0.803 | 44.66 |
SAMed + SIKD | 0.816 | 20.19 | 0.819 | 25.99 |
LM-Net [86] | 0.793 | 23.64 | | 0.727 | 50.85 |
LM-Net + SIKD | 0.808 | 20.76 | 0.755 | 38.59 |
2D D-LKA Net [87] | 0.833 | 18.96 | | 0.797 | 35.56 |
2D D-LKA + SIKD | 0.843 | 17.29 | 0.822 | 22.11 |
We also benchmark the generalization ability of different methods by conducting cross-dataset evaluation on AbdomenCT-1K dataset [
71] for the models trained on Synapse multi-organ segmentation dataset [
70]. As depicted in Tab.2, SIKD outperforms all the corresponding baseline models under cross-dataset evaluation. In particular, SIKD built on the baseline of SAUNet [
24] improves the baseline model by 11.44% Dice score and 18.48 mm HD. Compared with SAUNet which only incorporates shape information, SIKD achieves 7.17% Dice score and 12.21 mm HD improvement. Building SIKD on SAMed [
88] improves the baseline model by 1.6% Dice score and 18.67 mm HD. Based on the SOTA 2D D-LKA method, our approach achieves the best generalization results. This demonstrates that SIKD effectively incorporates the shape-intensity knowledge and generalizes well to images of unseen dataset.
Polyp segmentation Some qualitative polyp segmentation results are illustrated in Fig.5. The proposed SIKD based on U-Net can accurately segment the polyp. The quantitative comparison with the baseline models and some state-of-the-art methods is shown in Tab.3. SIKD outperforms all the other methods on Kvasir and CVC-ClinicDB dataset, on which the models are trained. In particular, compared with SAUNet which only incorporates the shape information, SIKD is more effective with an improvement ranging from 1.6% to 2.5%. SIKD built on SANet is comparable with recent LDNet [
92] which relies on lesion-aware dynamic kernel and cross and self-attention modules. SIKD built on MaxStyle [
85], SAMed [
88], LM-Net [
86], and 2D D-LKA Net [
87] further boosts the segmentation performance of the corresponding baseline method. It is noteworthy that these improvements are achieved without any extra runtime and memory cost during inference compared with corresponding baseline methods.
Fig.5 Some results on the intra-dataset (left two columns) and cross-dataset (right two columns) of polyp segmentation, ONH segmentation, and breast tumor segmentation (from top to bottom). Green outline: segmentation by the baseline U-Net model; Blue outline: segmentation by SIKD built upon U-Net; Light blue area: ground-truth segmentation |
Full size|PPT slide
Tab.3 Quantitative evaluation of polyp segmentation on Kvasir [73], CVC-ClinicDB [74], CVC-ColonDB [75], ETIS [76], and CVC-T [77] datasets |
Method | Intra-dataset | | Cross-dataset |
Kvasir | | CVC-ClinicDB | | CVC-ColonDB | | ETIS | | CVC-T |
IoU | DICE | | IoU | DICE | | IoU | DICE | | IoU | DICE | | IoU | DICE |
SFA [91] | 0.611 | 0.723 | | 0.607 | 0.700 | | 0.347 | 0.469 | | 0.217 | 0.297 | | 0.329 | 0.467 |
U-Net++ [12] | 0.743 | 0.821 | 0.729 | 0.794 | 0.410 | 0.483 | 0.344 | 0.401 | 0.624 | 0.707 |
PraNet [72] | 0.840 | 0.898 | 0.849 | 0.899 | 0.640 | 0.709 | 0.567 | 0.628 | 0.797 | 0.871 |
SANet [84] | 0.847 | 0.904 | 0.859 | 0.916 | 0.670 | 0.753 | 0.654 | 0.750 | 0.815 | 0.888 |
LDNet [92] | 0.853 | 0.907 | 0.895 | 0.943 | 0.706 | 0.784 | 0.665 | 0.744 | − | − |
U-Net [10] | 0.761 | 0.841 | | 0.838 | 0.885 | | 0.552 | 0.598 | | 0.323 | 0.383 | | 0.697 | 0.769 |
U-Net w/ SDM [90] | 0.766 | 0.849 | 0.843 | 0.895 | 0.567 | 0.641 | 0.349 | 0.405 | 0.630 | 0.706 |
U-Net + SIKD | 0.769 | 0.851 | 0.851 | 0.903 | 0.576 | 0.650 | 0.445 | 0.513 | 0.712 | 0.788 |
Baseline (BL) in [24] | 0.810 | 0.867 | | 0.829 | 0.886 | | 0.606 | 0.679 | | 0.448 | 0.510 | | 0.780 | 0.849 |
SAUNet [24] | 0.812 | 0.870 | 0.825 | 0.880 | 0.587 | 0.658 | 0.536 | 0.607 | 0.761 | 0.830 |
BL + SIKD | 0.836 | 0.889 | 0.850 | 0.896 | 0.619 | 0.687 | 0.532 | 0.600 | 0.791 | 0.866 |
PraNet [72] | 0.834 | 0.892 | | 0.859 | 0.909 | | 0.649 | 0.721 | | 0.579 | 0.653 | | 0.818 | 0.891 |
PraNet + SIKD | 0.852 | 0.904 | 0.875 | 0.927 | 0.657 | 0.733 | 0.607 | 0.679 | 0.830 | 0.897 |
SANet [84] | 0.845 | 0.903 | | 0.861 | 0.916 | | 0.679 | 0.762 | | 0.642 | 0.741 | | 0.785 | 0.870 |
SANet + SIKD | 0.856 | 0.909 | 0.880 | 0.929 | 0.712 | 0.790 | 0.678 | 0.760 | 0.823 | 0.892 |
MaxStyle [85] | 0.724 | 0.808 | | 0.724 | 0.790 | | 0.493 | 0.593 | | 0.305 | 0.364 | | 0.657 | 0.769 |
MaxStyle + SIKD | 0.755 | 0.835 | 0.747 | 0.813 | 0.510 | 0.603 | 0.361 | 0.427 | 0.671 | 0.780 |
SAMed [88] | 0.836 | 0.896 | | 0.795 | 0.865 | | 0.665 | 0.742 | | 0.565 | 0.644 | | 0.763 | 0.841 |
SAMed + SIKD | 0.838 | 0.897 | 0.802 | 0.872 | 0.688 | 0.761 | 0.626 | 0.709 | 0.786 | 0.859 |
LM-Net [86] | 0.827 | 0.893 | | 0.817 | 0.872 | | 0.653 | 0.734 | | 0.565 | 0.645 | | 0.752 | 0.835 |
LM-Net + SIKD | 0.835 | 0.895 | 0.844 | 0.896 | 0.687 | 0.765 | 0.598 | 0.689 | 0.790 | 0.864 |
2D D-LKA Net [87] | 0.833 | 0.890 | | 0.823 | 0.878 | | 0.664 | 0.742 | | 0.523 | 0.599 | | 0.771 | 0.832 |
2D D-LKA + SIKD | 0.846 | 0.902 | 0.831 | 0.885 | 0.702 | 0.779 | 0.606 | 0.678 | 0.805 | 0.876 |
We also evaluate the proposed SIKD on the other three unseen datasets to assess the generalizability of SIKD for polyp segmentation. As depicted in Tab.3, using the U-Net as the baseline model, SIKD achieves 2.4%, 12.2%, and 1.5% IoU (
resp., 5.2%, 13.0%, and 1.9% Dice score) on CVC-ColonDB, ETITS, CVC-T dataset, respectively. SIKD based on the baseline of SAUNet, PraNet, SANet, MaxStyle, SAMed, LM-Net [
86], and 2D D-LKA Net [
87] also achieves consistent improvement on these three unseen datasets. This demonstrates that the proposed SIKD generalize well to unseen datasets. In particular, SIKD with SANet performs better than the most recent LDNet [
92] in generalizing to images of unseen datasets.
We observe that the performance improvement for segmenting polyp across domains is not as significant as that for the other segmentation tasks (see Fig.1). Therefore, we further perform statistical analysis on polyp segmentation by training 10 times the baseline U-Net and the proposed SIKD built upon U-Net. The statistical results is depicted in Fig.6. The histogram in Fig.6 shows that SIKD achieves higher average performance and more stable results than the baseline model.
Fig.6 Comparison on distribution of dice scores for polyp segmentation when training 10 times the baseline U-Net and the proposed SIKD on training images of Kvasir [73] and CVC-ClinicDB [74] |
Full size|PPT slide
Optic nerve head segmentation We then conduct experiments on segmenting optic nerve head (ONH) in color fundus images. Some qualitative segmentation results are shown in Fig.5. Both the baseline model and the proposed SIKD accurately segment ONH on testing images of the REFUGE dataset, on whose training set the models are trained. Yet, the baseline model performs poorly on images with slightly different appearance. This demonstrates that the proposed SIKD can effectively incorporate the shape-intensity information, helping to yield accurate segmentation across domains.
The quantitative comparison between the baseline U-Net model and the proposed SIKD is depicted in Tab.4. On the REFUGE test set, SIKD slightly outperforms the baseline model. On the unseen RIM-ONE-r3 and Drishti-GS dataset, the proposed SIKD significantly improves the baseline model. Precisely, SIKD improves the baseline U-Net model by 23.5% (
resp. 19.4%) IoU and 27.8% (
resp. 18.7%) Dice score on the RIM-ONE-r3 (
resp. Drishti-GS) dataset. Similar performance behavior is observed on SIKD built on the baseline of SAUNet [
24]. It is noteworthy that though SAMed [
88], LM-Net [
86], and 2D D-LKA Net [
87] already exhibits a powerful generalization ability, our SIKD can still boosts the cross-dataset segmentation performance.
Tab.4 Quantitative result on optic nerve head segmentation in fundus images. The models are trained on training set of REFUGE [78] |
Method | Intra-dataset | | Cross-dataset |
REFUGE | | RIM-ONE-r3 | | Drishti-GS |
IoU | DICE | | IoU | DICE | | IoU | DICE |
U-Net [10] | 0.924 | 0.960 | | 0.266 | 0.317 | | 0.433 | 0.505 |
U-Net w/ SDM [90] | 0.925 | 0.960 | 0.346 | 0.419 | 0.497 | 0.624 |
U-Net + SIKD | 0.928 | 0.962 | 0.501 | 0.595 | 0.627 | 0.692 |
Baseline (BL) in [24] | 0.884 | 0.938 | | 0.696 | 0.812 | | 0.782 | 0.874 |
SAUNet [24] | 0.906 | 0.950 | 0.662 | 0.766 | 0.840 | 0.911 |
BL + SIKD | 0.920 | 0.958 | 0.708 | 0.821 | 0.861 | 0.924 |
MaxStyle [85] | 0.927 | 0.961 | | 0.736 | 0.846 | | 0.889 | 0.939 |
MaxStyle + SIKD | 0.935 | 0.966 | 0.748 | 0.852 | 0.897 | 0.945 |
SAMed [88] | 0.920 | 0.958 | | 0.772 | 0.870 | | 0.906 | 0.950 |
SAMed + SIKD | 0.922 | 0.960 | 0.782 | 0.879 | 0.916 | 0.958 |
LM-Net [86] | 0.909 | 0.952 | | 0.730 | 0.841 | | 0.933 | 0.965 |
LM-Net + SIKD | 0.926 | 0.961 | 0.777 | 0.873 | 0.939 | 0.967 |
2D D-LKA Net [87] | 0.897 | 0.946 | | 0.744 | 0.851 | | 0.930 | 0.963 |
2D D-LKA + SIKD | 0.918 | 0.955 | 0.768 | 0.869 | 0.938 | 0.966 |
Breast tumor segmentation We then conduct experiments on breast tumor segmentation in ultrasound images. Some qualitative illustrations are given in Fig.5. Compared with the baseline U-Net, SIKD accurately segments the breast tumor under both intra-dataset and cross-dataset settings. The quantitative comparison between the baseline U-Net and SIKD is depicted in Tab.5. SIKD achieves significant improvement over the baseline U-Net. Specifically, on the BUSI dataset, SIKD improves the baseline U-Net by 3.1% IoU and 3.3% Dice score. SIKD also outperforms U-Net w/ SDM [
90] by 2.3% IoU and 2.8% Dice score. Implementing SIKD with the baseline of SAUNet [
24] improves the baseline by 8.7% IoU and 7.7% Dice score. Besides, the proposed SIKD also outperforms SAUNet by 2.4% IoU and 2.1% Dice score.
Tab.5 Quantitative evaluation of different methods for breast tumor segmentation on BUSI [81], IDFAHSTU [82], and dataset B [83] |
Method | Intra-dataset | | Cross-dataset |
BUSI | | IDFAHSTU | | Dataset B |
IoU | DICE | | IoU | DICE | | IoU | DICE |
U-Net [10] | 0.580 | 0.676 | | 0.597 | 0.694 | | 0.397 | 0.461 |
U-Net w/ SDM [90] | 0.588 | 0.681 | 0.626 | 0.732 | 0.411 | 0.483 |
U-Net + SIKD | 0.611 | 0.709 | 0.646 | 0.755 | 0.449 | 0.521 |
BL in [24] | 0.535 | 0.642 | | 0.611 | 0.738 | | 0.362 | 0.459 |
SAUNet [24] | 0.598 | 0.698 | 0.696 | 0.805 | 0.513 | 0.615 |
BL + SIKD | 0.622 | 0.719 | 0.717 | 0.824 | 0.540 | 0.643 |
MaxStyle [85] | 0.603 | 0.701 | | 0.755 | 0.851 | | 0.529 | 0.631 |
MaxStyle + SIKD | 0.612 | 0.707 | 0.767 | 0.861 | 0.561 | 0.661 |
SAMed [88] | 0.672 | 0.762 | | 0.780 | 0.871 | | 0.684 | 0.778 |
SAMed + SIKD | 0.686 | 0.772 | 0.793 | 0.879 | 0.697 | 0.790 |
LM-Net [86] | 0.663 | 0.755 | | 0.737 | 0.842 | | 0.571 | 0.663 |
LM-Net + SIKD | 0.684 | 0.773 | 0.764 | 0.862 | 0.633 | 0.719 |
2D D-LKA Net [87] | 0.674 | 0.762 | | 0.593 | 0.719 | | 0.671 | 0.760 |
2D D-LKA + SIKD | 0.691 | 0.775 | 0.655 | 0.780 | 0.694 | 0.781 |
We also evaluate the proposed SIKD on images from the unseen dataset used in [
82] and the dataset B [
83] to assess the generalizability of SIKD. As depicted in Tab.5, applying SIKD on the baseline of SAUNet [
24], MaxStyle [
85], SAMed [
88], LM-Net [
86], and 2D D-LKA Net [
87] boosts the segmentation performance of the corresponding baseline. This also demonstrates that the proposed SIKD generalizes well to images of unseen dataset. It is noteworthy that the performance on the cross-dataset of [
82] is better than the intra-dataset performance for breast tumor segmentation. This is probably because that the dataset in [
82] only contains 42 ultrasound images, which are not as challenging as the ultrasound images in the BUSI dataset [
81] and dataset B [
83].
5 Discussion
Medical image segmentation plays a crucial role in clinical practice. With the advancement of deep learning, many approaches have achieved notable results. However, many deep-learning-based methods treat segmentation as a pixel-wise classification task, often overlooking the fact that targets in medical images typically possess specific shape-intensity prior information. Additionally, some studies [
93,
94] have demonstrated that deep neural networks tend to prioritize learning texture information over shape information. This bias can lead to anatomically implausible segmentation results, especially in cross-dataset segmentation, where domain shifts relative to the training images are present. Therefore, incorporating shape-intensity prior information is beneficial for both intra-dataset and cross-dataset segmentation.
Recently, some methods improve segmentation performance by incorporating shape information, making the segmentation results more reasonable. However, these methods either add edge constraints or further optimize the segmentation results, requiring additional computational resources. Different from existing methods, in this paper, we propose to incorporate joint shape-intensity prior information into the segmentation network. Specifically, we train a teacher network on the class-wise averaged training images to extract the shape-intensity information. We then employ KD to transfer the extracted shape-intensity information from the teacher network to the student network (i.e., the segmentation network).
The proposed SIKD relies on transferring the shape-intensity knowledge extracted by the teacher model. Therefore, we analyze the shape-intensity knowledge by visualizing the penultimate layer feature. As shown in Fig.7, compared with the baseline model, the feature extracted by the teacher model is smoother and has a more complete shape structure. The feature of the student model is very similar to the teacher model. This implies that the proposed SIKD effectively incorporates the shape-intensity knowledge learned by the teacher model into segmentation.
Fig.7 Visualization of the penultimate layer feature of baseline U-Net and the proposed SIKD built upon U-Net |
Full size|PPT slide
As previously mentioned, our method does not require extra runtime/memory cost during inference. We also analyze the training time requirements. In comparison to the baseline model, our proposed SIKD involves training an additional teacher model on class-wise averaged images to extract shape-intensity prior information. This information is subsequently transferred to the student model via KD. During the training of the student network, it is necessary to forward the teacher network to compute the KD loss. As shown in Tab.6, during training, our method’s FLOPs are nearly twice as high as the U-Net (as mentioned in Section 4.2 of the manuscript, the skip connections of the U-Net used as the teacher network have been removed, thereby reducing the complexity of the teacher model), and GPU usage is also slightly higher than U-Net. However, during testing, both our method and the U-Net have the same FLOPs and GPU usage, incurring no additional overhead.
Tab.6 Comparison of computational requirements based on U-Net [10] architecture. Calculations for training are performed with a batch size of 16 and image dimensions of 256 256 1, while testing is conducted with a batch size of 1 and the same image size |
Method | Training | | Testing |
GFLOPs/G | GPU/G | | GFLOPs/G | GPU/G |
U-Net [10] | 65.46 | 9.61 | | 65.46 | 0.25 |
U-Net + SIKD | 121.21 | 10.47 | | 65.46 | 0.25 |
We then conduct two types of ablation studies on the task of cardiac segmentation and optic nerve head segmentation.
Ablation study on the effect of transferring the shape-intensity knowledge A straightforward alternative is to train the teacher model using the original training images. Therefore, we train both the teacher and student U-Net model using the same setting. The only difference with the baseline model is that we also adopt the distillation loss when training the student model. As depicted in Tab.7, such a trivial alternative may occasionally bring some intra-dataset improvement, but not as significant as the proposed SIKD. This demonstrates that the performance improvement of SIKD is mainly brought by the proposed shape-intensity knowledge transferring via knowledge distillation, not the trivial knowledge distillation. We also conduct experiments by training the teacher network on the annotated label maps. As depicted in Tab.7, the variant of SIKD by transferring only the shape knowledge directly extracted from the label map is also beneficial for improving the segmentation performance and generalization ability. Yet, since the intensity of the label map is quite different from the original image in distribution, this variant of SIKD is not as effective as SIKD that distills the shape-intensity knowledge from the teacher model trained on class-wise averaged training images.
Tab.7 Evaluation of SIKD on cardiac and optic nerve head segmentation with different inputs for the teacher network |
Input | Cardiac segmentation in MRI images | | Optic nerve head segmentation in color fundus images |
ACDC | | M&Ms | | REFUGE | | RIM-ONE-r3 | | Drishti-GS |
Dice | HD | | Dice | HD | | IoU | DICE | | IoU | DICE | | IoU | DICE |
Baseline U-Net [10] | 0.884 | 1.30 | | 0.686 | 4.08 | | 0.924 | 0.960 | | 0.266 | 0.317 | | 0.433 | 0.505 |
Original image | 0.882 | 1.26 | | 0.693 | 4.09 | | 0.927 | 0.961 | | 0.371 | 0.446 | | 0.571 | 0.631 |
Annotated label map | 0.893 | 1.16 | | 0.723 | 3.69 | | 0.922 | 0.961 | | 0.383 | 0.452 | | 0.552 | 0.607 |
Class-wise mean average | 0.894 | 1.21 | | 0.785 | 2.95 | | 0.928 | 0.962 | | 0.501 | 0.595 | | 0.627 | 0.692 |
Ablation study on the loss weight The only hyper-parameter for the proposed SIKD is the loss weight involved in Eq. (3). We conduct ablation study on this hyper-parameter by setting to 0.5, 1.0, 2.0, 3.0, 4.0, and 5.0, respectively. The corresponding results are depicted in Fig.8. Using different slightly affects the intra-dataset performance. The generalization ability changes more significantly. SIKD with different settings for generally performs much better than the baseline model (equivalent to set ), further proving the effectiveness of SIKD. Setting gives the best Dice score on cardiac segmentation and performs relatively well on optic nerve head segmentation. Therefore, we set to 2 for the proposed SIKD in all experiments.
Fig.8 Evaluation of SIKD with different settings for in Eq. (3). Setting = 0 is equivalent to the baseline model. (a) Effect of using different α for cardiac segmentation; (b) effect of using different α for optic nerve head segmentation |
Full size|PPT slide
6 Conclusion
In this paper, we propose a novel joint shape-intensity KD method for deep-learning-based medical image segmentation. We leverage KD to transfer the shape-intensity information extracted by the teacher network trained on class-wise averaged images. In this way, the student network with the same network architecture as the teacher model effectively learns shape-intensity information for medical image segmentation. Extensive experiments on five medical image segmentation tasks of different modalities demonstrate that the proposed SIKD achieves consistent/significant improvements over the baseline methods and some state-of-the-art methods. The proposed SIKD can be applied to most popular segmentation network and bring performance improvement without any additional computation effort during inference. In the future, we would like to explore other feature layers for distillation and other distillation loss functions, and apply the proposed SIKD to 3D medical image segmentation task.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}