RESEARCH ARTICLE

Toward few-shot domain adaptation with perturbation-invariant representation and transferable prototypes

  • Junsong FAN 1,2 ,
  • Yuxi WANG 1,2,3 ,
  • He GUAN 1,2 ,
  • Chunfeng SONG 1,2 ,
  • Zhaoxiang ZHANG , 1,2,3
Expand
  • 1. Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
  • 2. University of Chinese Academy of Sciences, Beijing 100049, China
  • 3. Centre for Artificial Intelligence and Robotics, HKISI_CAS, HongKong 999077, China

Received date: 08 Jan 2022

Accepted date: 09 Mar 2022

Published date: 15 Jun 2022

Copyright

2022 Higher Education Press

Abstract

Domain adaptation (DA) for semantic segmentation aims to reduce the annotation burden for the dense pixel-level prediction task. It focuses on tackling the domain gap problem and manages to transfer knowledge learned from abundant source data to new target scenes. Although recent works have achieved rapid progress in this field, they still underperform fully supervised models with a large margin due to the absence of any available hints in the target domain. Considering that few-shot labels are cheap to obtain in practical applications, we attempt to leverage them to mitigate the performance gap between DA and fully supervised methods. The key to this problem is to leverage the few-shot labels to learn robust domain-invariant predictions effectively. To this end, we first design a data perturbation strategy to enhance the robustness of the representations. Furthermore, a transferable prototype module is proposed to bridge the domain gap based on the source data and few-shot targets. By means of these proposed methods, our approach can perform on par with the fully supervised models to some extent. We conduct extensive experiments to demonstrate the effectiveness of the proposed methods and report the state-of-the-art performance on two popular DA tasks, i.e., from GTA5 to Cityscapes and SYNTHIA to Cityscapes.

Cite this article

Junsong FAN , Yuxi WANG , He GUAN , Chunfeng SONG , Zhaoxiang ZHANG . Toward few-shot domain adaptation with perturbation-invariant representation and transferable prototypes[J]. Frontiers of Computer Science, 2022 , 16(3) : 163347 . DOI: 10.1007/s11704-022-2015-7

1 Introduction

With the help of large-scale datasets, deep learning-based approaches have shown unprecedented advances in many applications. However, the acquisition of well-labeled large-scale datasets is time-consuming and expensive, limiting the application of these advanced deep models. This problem is even more severe in the dense pixel-wise takes like the semantic segmentation, where full annotations require precise per-pixel class discrimination [1]. Meanwhile, models trained with the existing source scenes generally perform poorly when generalized to new target scenes due to the well-known domain gap problem. Thus, it is necessary to finetune the model on the target domain with some data. To achieve acceptable performance on the target scenes while minimizing the data annotation burden, a natural idea is adapting the model from the pretrained labeled source domain to the unlabeled target domain, i.e., unlabeled domain adaptation (UDA) for semantic segmentation [2-5].
The UDA approach is developed to learn domain-invariant representations and improve the model’s performance on the target domain without knowing target domain labels. Most UDA methods conduct this task by matching different levels of distributions between the source and the target domains to suppress the domain shifting, such as pixel-level [6] and feature-level [2, 7, 8] distributions. However, due to the absence of supervision on the target domain, there is still a significant performance gap between the state-of-the-art UDA and fully supervised scenarios, especially for pixel-level semantic segmentation tasks. The semi-supervised domain adaptation (SSDA) [9-11] proposes to alleviate this problem by using a proportion of labeled data in the target domain. However, the required annotation amount is generally large, significantly increasing the annotation burden.
Alternatively, in this work, we hope to imitate the human ability that can learn from very little supervision to improve the domain adaptation performance while minimizing the annotation burden. To this end, we propose to conduct the few-shot domain adaptation (FSDA) for semantic segmentation, where only a fixed number of few labels are available in the target domain. Compared to earlier SSDA, the proposed FSDA adopts extremely fewer labeled samples for training, which is much easier to obtain. However, the FSDA model faces severe challenges of overfitting with the very few target samples, getting stuck in class imbalance problems, and encountering absent classes not present in target labels. Compared to the UDA, with the help of the few labels, the representations of each category can be reliably well-aligned across the source and the target domains to achieve a consistent representation space. As a result, the models’ performance can be significantly improved.
We focus the design on effectively leveraging the few-shot labels on the target domain to help mitigate the domain gap. To this end, in addition to the common semantic segmentation [1] and adversarial loss [12] for the DA, we propose two main components to help the FSDA problem. Firstly, we propose a transferable prototype-based module to align the source and the target domains with the help of few-shot labels on the target set. Unlike the instance-level or pixel-level representations that are vulnerable, the prototypes are usually more robust and stable because of the context across different samples. Thanks to the few-shot labels, the representations of each category can be reliably expressed with the prototypes and then aligned across domains. We designed two kinds of prototype transferring objectives to align the domains, i.e., the category-level and the task-level constraints, which perform per-category and the whole distribution alignment, respectively.
Secondly, we propose a data perturbation-based approach to help further enhance the representation robustness when only the few-shot labels are available. Unlike the SSDA problem, the proposed FSDA holds much fewer annotations on the target set to learn. As a result, a naive training procedure with these few-shot labels would lead to severe overfitting problems. To this end, we propose to apply the data perturbation technique to help alleviate this problem. To better adapt to the distribution of the data and inspired by previous auto-augmentation work [13], we search the perturbation function for the semantic segmentation task with the objective that the perturbed candidates should be consistent with the original candidates in the semantics level. Experiments demonstrate that the proposed approach can effectively enhance the representation robustness and improve the model’s generalization ability.
We summarize our contributions as follows:
● We define a novel type of domain adaptation, i.e., few-shot domain adaptation (FSDA), which pursues fully-supervised performance while minimizing the annotation burden when adapting to new domains. We believe this setting is more close to the applications in the real world.
● We propose a transferable prototype-based module to align the source and the target domains with only the few-shot labels. Comprehensive experiments demonstrate the effectiveness of our method, especially with few-shot labels.
● We introduce data perturbation-based approach to help alleviate the overfitting problem when only a few labeled data are available. Experiments demonstrate our approach can effectively improve representation consistency and model robustness for the domain adaptation problem.

2 Related work

Domain adaption is a challenging task that focuses on the model transferability [14] to the new data domain. This problem is popular in the computer vision for classification [15, 16], object detection [17-19], and segmentation [2, 19, 20]. This work focuses on the challenging domain adaptation for the semantic segmentation task, which requires accurate dense pixel-level predictions while lacking supervision in the target domain. This task is attractive in the field of segmentation to save annotation burdens for various applications [21,22]. The first work introducing the task is FCNs [2], which applies the adversarial training strategy [12, 23] on a segmentation network with the source and the target domains. Following the adversarial training strategy, numerous methods [5-8,24] have been explored. The main insight behind these approaches is to align the distributions of image representation from the source and the target domains. They can be roughly categorized into three types according to the adversarial information: feature-level adaptation [2, 7, 8] focuses on the distribution shift in the latent feature space; pixel-level adaptation [6] refers to style transfer to generate indistinguishable predictions for both domains; simultaneous feature-level and pixel-level adaptation [24] aims to tackle the problem by considering both the feature and pixel-level information. Furthermore, MCD [3] proposes a novel method considering task-specific decision boundaries. APL [19] explores the activate pseudo-labeling strategy to alleviate the domain gap. DSP [20] proposes an efficient soft-paste strategy for the segmentation of domain adaptation problems. CLAN [5] focuses on category-shift by aligning each category distribution. Although these methods have achieved significant improvement, there still exists a large gap between source and target distributions. Consequently, the domain adaptation models underperform the fully supervised models with a large margin. We noticed that in real-world applications, it is generally convenient to obtain few-shot labels on the target scenario. Therefore, we propose to apply few-shot domain adaptation to pursue fully supervised performance while minimizing the annotation burden. Different from the semi-supervised domain adaptation [25, 26] that requires much more annotations on the target domain, we take advantage of the information of sufficient unlabeled target domain images and only require few-shot target labels as hints.
Data perturbation has been widely studied recently in the field of deep learning, especially for image recognition and object detection tasks. These approaches enhance the model performance by providing augmented data and training targets. Typical augmentation operations are random cropping, flipping, rotation, scaling, geometry transformation, image mixing, color jittering, and so on. There are generally two strategies to apply the data perturbations. The first series apply manually designed rules to perform the perturbation, such as Mixup [27], Cutout [28], and CutMix [29]. On the other hand, some works apply automatically searched policies to conduct the perturbation, such as AutoAugment [30], AugDetection [31], and FastAutoAug [13]. Our approach with the data perturbation follows the second group works, which applies learned augmentation policies to enhance the model's robustness and generalization ability for the few-shot domain segmentation task. It is noteworthy that previous auto-augmentation methods cannot be directly applied to our semantic segmentation tasks due to the label misalignment problem caused by the geometric transformation with the images. To this end, we generate a specialized data augmentation strategy appropriate for the segmentation task. The adversarial attack strategy [32] is also widely applied to learn robust representations for the original samples. Differently, our data perturbation strategy concerns more with the new data generalization process that covers both the source and the target domains.
Few-shot learning aims to transfer the knowledge learned from the abundant training set with labels to the target novel classes so that the model can conduct fast learning with few labeled data on the novel classes. The existing few-shot learning methods can be divided into four types: learning to finetune models, learning recurrent neural networks with memories, network parameters prediction, and metric learning. Our approach is more related to the metric learning-based methods in the few-shot classification tasks. In this field, Siamese Networks [33] utilize a two-stream convolutional network to construct metric learning. Matching Networks [34] capture the similarity between one query and multiple support images. Prototypical Networks [35] learn a metric space to compute distances between datapoint and prototype representations for each class. In our few-shot domain adaptation task, there is no category difference between the source and target domain, and the domain discrepancy is the main challenge to influence the performance. Therefore, we apply prototypical networks to model category-level domain-invariant feature representations among source and target domains, utilizing their generalization capability from a few-shot image to the entire dataset.

3 Approach

Preliminary We first elaborate on the problem setting of our few-shot domain adaptation (FSDA) task for semantic segmentation. The task contains a source domain dataset, where both the image and corresponding label are available, and a target domain dataset, where only a few images are coupled with labels and others are unlabeled images. Formally, let Xs and Xt denote the sets of source domain images and target domain images, respectively. We have access to ns source domain samples, each contains an image xsi and the corresponding semantic segmentation label ysi, i.e., Is={(xsi,ysi)}i=1ns, and few target domain samples with labels, It={(xti,yti)}i=1nt. Generally, the number of target domain samples are much less than the source domain’s, ntns. Additionally, we have access to abundant unlabeled target samples, Iu={(xti)}i=1nu. The goal is to employ these data to train and finetune a model to perform the semantic segmentation task on the target domain, and the test data are unseen data in the target domain.

3.1 Overview of the proposed model

Our approach contains three major components. The first one is the adversarial semantic segmentation training procedure, which is widely applied in the field of domain adaptation for semantic segmentation [5, 7]. It contains the main semantic segmentation model G to produce the segmentation results and an additional discriminator D for adversarial training across the domains. It applies both the source and the target domain data with labels, i.e., Is and It for the segmentation loss and additionally applies the unlabeled data Iu for the adversarial loss. The second component is the proposed transferable prototype-based module to help bridge the gap between the source and the target domain, where two kinds of alignment strategies are proposed, i.e., the category-level and the task-level alignment. The last one is the data perturbation strategy to alleviate the overfitting problem on the few labeled target samples, which adapts the perturbation policy designed for the semantic segmentation task for the given data. We elaborate on the details of these methods in the following sections.

3.2 Adversarial semantic segmentation

Following the common practice in the field of DA for semantic segmentation [5, 7], our method basically adopts the adversarial training strategy for semantic segmentation. We first introduce the training objective of the adversarial learning. Given the source domain images Xs and unlabeled target domain images Xu, the goal is to help the segmentation model G to learn domain invariant predictions from these two domains. It is achieved by the adversarial training with an auxiliary discriminator, where the discriminator is trained to distinguish the predictions from the two domains while the segmentation model is trained to fool the discriminator to provide indistinguishable predictions. This procedure is implemented by the min-max game:
{Ladv(G,D)=minD[E(G,D)]+maxG[E(G,D)],E(G,D)=E[log(D(G(xs))]E[log(1D(G(xu)))],
where xsXs and xuXu and the images from the source and the target domains, respectively. E[] stands for the expectation of the value of the distribution.
On the other hand, we introduce the semantic segmentation loss to optimize the segmentation model G. The loss is minimized with both the labeled source samples and the few-shot labeled target samples. Given images xsXs and xtXt with annotations ys and yt, we formalize the semantic segmentation loss as:
Lseg(G)=i=1nsLseg(G;(xsi,ysi))+i=1ntLseg(G;(xti,yti)),
where Lseg(,) denotes multi-category cross-entropy loss. Given xR3×H×W and yRC×H×W, Lseg(,) is defined as:
Lseg(G;(x,y))=i=1H×Wc=1Cyi,clogpi,c,
where pi,c denotes the predicted probability of class c on pixel i, which is obtained by the segmentation model G’s prediction that normalized by the Softmax function, and yi,c denotes corresponding the ground truth of class c on pixel i.

3.3 Adaptive data perturbation

In this section, we introduce how to implement the adaptive data perturbation strategy to enhance the model robustness in the few-shot domain adaptation task. As been pointed by previous works [6, 24], the domain shift can be largely attributed to the style difference. The styles are typically represented as the color tunes, brightness, saturation, and so on. These methods adopt the style-transfer [6] or pixel-level image generation [24] to try to address this problem. From the view of style transferring, our adaptive data perturbation approach can also alleviate the style-shifting problem. However, different from previous approaches that narrow down the style space for the source and/or target domains to align them, our approach transfers the data to a larger style space and hopes it covers both the source and the target domains. Therefore, our data perturbation strategy can provide better generalization and avoid the model overfitting to some specific styles.
We adopt the efficient RL-based augmentation method [13] to realize the adaptive data perturbation strategy, which has been successfully demonstrated in images classification tasks. To adapt the approach for segmentation tasks, we design a new search space containing nine operations, which exclude operations such as the affine transformation and image rotation that change the distributions and structures of segmentation labels. The semantic segmentation augmentation policies are learned by utilizing the Bayesian hyperparameters optimization. We optimize the parameters under the synthetic dataset with the K-folds method to adapt the perturbation policies to the segmentation task. Then we use HyperOpt [36] to search the optimal augmentation policies. Once policies are obtained, we fix them and treat them as the perturbation functions to extend the dataset online during the training process.
As a direct utilization of the data perturbation strategy, we propose to apply the semantic consistency constraint. Intuitively, a well-optimized domain-invariant model should produce predictions invariant to the perturbation of the input image. We leverage this property to train the model as a complementary constraint when there are no ground truth labels for training. Specifically, we take the predictions of the original examples as the target distributions to optimize the perturbed predictions and apply the KL-divergence constraint between these two predictions. Given unlabeled target images, Iu, the corresponding perturbed images are represented as Ia, which are obtained by the aforementioned perturbation policies. Then, both Iu and Ia are fed into the segmentation model G to produce the segmentation predictions, which are normalized by the Softmax function and denoted by Pu and Pa, respectively. Following VAT [37], we define our semantic consistency loss as:
Ls_cyc(G)=DKL[Pu(G;xu)Pa(G;xa)],
where, xu and xa are drawn from Iu and Ip, respectively. The parameters are trained on Pa(G,xa) while fixed on Pu(G,xu).

3.4 Transferable prototype module

With the aid of the source domain labels and few-shot labels in the target domain, we can obtain reliable prototypes representative of the corresponding categories. Based on the noiseless prototypes, we propose the transferable prototype module to align both the intermediate features and final predictions across the two domains. Specifically, a category-level adaptation and a task-level adaptation are proposed. In category-level adaptation, we directly match the prototypes of the same category obtained from samples in both the source and the target domains. As shown in Fig.1, the prototypes of the same category k in the source domain ( ck) and the target domain ( c^k) should be close to each other. By means of this, the feature representations of the two domains are implicitly aligned. In the task-level adaptation, we minimize the discrepancy of predictions from the source and target prototype based classifiers for unlabeled target samples, which further helps to align the domains through the prediction distributions. We utilize the above two components simultaneously to optimize the network G.
Fig.1 Source images Is, labeled target images It, unlabeled target images Iu and perturbed images Ia are forwarded into the segmentation networks G. The corresponding latent features are represented by cuboid with different colors. The features obtained from Is and It are trained for segmentation, and Is and Iu are used to train discriminator D. Furthermore, Ia and Iu construct the semantic consistency constraint and all the features are used to train the transferable prototypical networks

Full size|PPT slide

Transferable prototypes for FSDA We first illustrate the details to compute the prototypes and apply them to the prototype-based classifiers. We remold Prototypical Networks [35] for the scenario of semantic segmentation. Given a pair of the image and the corresponding label (x,y), let f(G;x)RHW×d denote the feature embedding of the input image x through the model G, where d is the embedding dimension. The prototype of the kth class is computed as:
ck=1Nki=1HWfi(G;x)I[yi=k],
where, fi denotes the d-dimensional embedding of the pixel i, yi denotes its label, I[] is the identical function that equals 1 when the condition is true or else 0, Nk=i=1HWI[yi=k] is a reduction factor.
With the prototypes obtained by Eq. (5), we can utilize them as the classifier weights and further obtain the prototype-based segmentation probabilities:
P^(yi=k|x)=exp(D(fi,ck))j=1Cexp(D(fi,cj)),
where P^ represents the prototype-based probabilities. D is the Euclidean distance function following [35], which measures the similarity between the per-pixel feature representations and the prototypes. C is the total number of classes in the dataset.
Category-level adaptation Category-shift remains a serious problem in domain adaptation even when the entire distribution of the source domain matches the target domain perfectly. Therefore, it is necessary to align category-level distributions. We achieve this goal by minimizing the prototype distributions between the source and the target domains. In particular, we obtain two kinds of prototypes based on labeled source images and few-shot target images, respectively:
{cks=1Nk(ys)i=1HWfi(G;xs)I[yis=k],ckt=1Nk(yt)i=1HWfi(G;xt)I[yit=k],
where Nk(ys) and Nk(yt) denote the number of pixels in the image xs and xt of category k, respectively.
It should be noted that the prototypes obtained from the few-shot images have a bias between the prototypes from the entire dataset. To resolve this problem, we calculate target prototypes by both few-shot target images and data perturbation augmented target images with labels. To address the tricky case that the few labeled images only contain a subset of categories in the source domain, we ignore the categories which are absent in few-shot target images. However, in most cases, the few-shot target images reasonably contain major categories and share common objects and spatial layouts. We minimize the distances of prototypes within the same category across domains:
Lcate(cks,ckt)=k=1Ccksckt2.
Task-level adaptation is further proposed to reduce the domain discrepancy. Given an image, a well-optimized model should produce the same prediction probabilities with both the source and the target prototype-based classifiers. This necessary condition provides a new constraint with the prototypes that depends on the task predictions. We implement the constraint by minimizing KL-divergence for the distribution of the predictions. Specifically, considering unlabeled target image xu, the predictions obtained from the source and the target prototype classifiers are P^s(G;xu) and P^t(G;xu) according to Eq. (6). Our task-level adaptation is defined as follows:
Ltask(xu)=DKL[P^s(G;xu)P^t(G;xu)],
Our final loss for category-level transferable prototypical networks is:
Lproto(G;xs,xt,xu)=Lcate(cks,ckt)+Ltask(xu).

3.5 Training objective

The overall training objective of our adversarial few-shot domain adaptation for semantic segmentation contains all aforementioned losses:
L=Lseg(G)+λadvLadv(G,D)+λs_cycLs_cyc(G)+λprotoLproto(G),
λadv, λs_cyc, and λproto are the weights to balance these losses.

4 Experiments

4.1 Datasets

We evaluate our method on the popular domain adaptation semantic segmentation tasks: GTA5 [38] Cityscapes [39] and SYNTHIA [40] Cityscapes. To the best of our knowledge, these tasks are the standard evaluations for the domain adaptation semantic segmentation performance. Both GTA5 and SYNTHIA are synthetic datasets that are generated using computer graphics. Thus it is convenient and cheap to obtain many samples with precise ground truth labels. In contrast, the Cityscapes dataset is obtained in real-world scenes and is hard to annotate. We apply them as the source datasets and take the real-world Cityscapes as the target dataset to conduct the adaptation task. GTA5 contains 24,966 images with a resolution of 1914×1024. SYNTHIA consists of 9,400 images with a resolution of 1280×760. Cityscapes is a real-world dataset containing high-quality images of street scenes collected from 18 different cities. It contains 2,975 images for training and 500 images for validation, and the resolution is 2048×1024. We randomly select N samples from each of the 18 cities in Cityscapes for the N-shot domain adaptation task. Consequently, there are 90 labeled images totally in the target set for the 5-shot domain adaptation semantic segmentation. All the remaining training images are taken as the unlabeled ones for adaptation.

4.2 Implementation details

We first learn a data perturbation policy for the semantic segmentation problem. Following the algorithm [13], we search on the GTA5 dataset, and the obtained final perturbation strategy is applied in all the following experiments. To obtain the perturbation strategy compatible with the segmentation task, we only consider color operations as the search space, not altering the pixel-level labels’ structure and distribution. Specifically, there are nine operations in consideration, i.e., AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, and Sharpness. These operations are chosen not to change the distribution and structure of the segmentation labels. Unlike the classifications that only require global image tags, the segmentation task needs to predict accurate pixel-level labels and relies on the position contexts. Hence augmentation strategies significantly changing the label structure are generally harmful to the segmentation, such as affine transformation and image rotation. We utilize the DeepLab-v2 [1] framework pre-trained on the ImageNet dataset as our segmentation detector. And we follow the hyper-parameters in [13] for training. Briefly, we train the model using SGD optimizer with the learning rate of 2.5×104, weight decay of 5×104, the momentum of 0.9, and the cosine learning decay with one annealing cycle [44].
Considering our few-shot domain adaptation training, we adopt the DeepLab-v2 framework as our segmentation network G. We follow the approach in AdaptSeg [7] as the baseline method. It is noteworthy that the multi-level adversarial strategy is not applied to save the computation memory. To make a fair and comprehensive comparison, we experiment with different backbones, i.e., VGG16 and ResNet-101, to evaluate the proposed approach. For discriminator network D, we construct five convolutional layers with the kernel size 4×4, channel numbers of {64,128,256,512,1} and stride of 2. Each layer is followed by a Leaky-ReLU parameterized by 0.2 except the last layer. During training, we use SGD optimizer for G with a momentum of 0.9 and a weight decay of 104. The initial learning rate is set as 2.5×104 and decayed by a poly learning rate policy with a power of 0.9. In our experiment, we follow previous work [5] to set λadv=0.001, and set λs_cyc=1.0 and λproto=0.01 by validation experiments. In addition, we use the Adam optimizer to optimize D with β1=0.9, β2=0.99. We initialize the learning rate as 104 and adopt the same poly decay scheduler as the segmentation network.
We use PyTorch [45] for implementation. Performances are evaluated by the Intersection-of-Union (IoU) metric.

4.3 Experimental results

In this section, we present the experimental results on adaptation tasks GTA5 Cityscapes and SYNTHIA Cityscapes, respectively. We evaluate the proposed few-shot domain adaptation method under different settings. We first report source-only results that are trained on the synthetic dataset GTA5 or SYNTHIA and directly evaluated on the real-world Cityscapes dataset. Furthermore, we compare our method with the state-of-the-art UDA methods for semantic segmentation, including CDA [8], Cross-city [41], CyCADA [24], AdaptSeg [7], CLAN [5] and so on. For a fair comparison, we directly report the results following their original papers. We construct the baseline model following the standard UDA method AdaptSeg. Based on the baseline model, we evaluate our proposed semantic consistency constraint ( Ms_cyc) and transferable prototype module ( Mproto), respectively. Experiments are conducted under a 5-shot setting, and the details of our results will be elaborated on in the following.
GTA5 to cityscapes Tab.1 shows the results on task GTA5 Cityscapes with comparisons to state-of-the-art methods. From the results, we can conclude that our approach significantly improves the models’ adaptation ability with both the VGG16 and the ResNet-101 networks. Specifically, our method outperforms the source-only method by 24.2% and 16.6% with the VGG16 and the ResNet-101, respectively. Meanwhile, compared to state-of-the-art unsupervised domain adaptation methods, our method also brings a large margin improvement with 5.5% and 2.3% points under the architectures VGG16 and ResNet-101, respectively. In addition, we ablatively compare the proposed approaches to the baseline model. We observe that both the semantic consistency module and the transferable prototype module are beneficial to improve domain adaptation ability. They bring over 6.8% and 6.9% improvement based on the ResNet-101 while 4.4% and 3.9% points on the VGG16, respectively. It should be noticed that these two factors are complementary to each other. Simultaneously applying both of them brings 9.2% and 5.7% improvements on ResNet-101 and VGG16, respectively.
SYNTHIA to cityscapes In the task of SYNTHIA Cityscapes, we consider the 13 categories following [7] for a fair comparison. The results are shown in Tab.2. We can observe similar trends as the task GTA5 Cityscapes in experiments. Specifically, our method outperforms the baseline method by 7.7% and 6.9% points on VGG16 and ResNet-101, respectively. Note that our method attains 9.4% and 8.0% gains with only 5-shot images labeled. It further demonstrates the generalization and effectiveness of our approach.
Tab.1 Results of adaptation from GTA5 to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then, we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold
GTA5→Cityscapes
Method Arch. Road Side Build Wall Fence Pole Light Sign Vege Terr Sky Pers Rider Car Truck Bus Train Motor Bike mIoU
Source-only V 26.0 14.9 65.1 5.5 12.9 8.9 6.0 2.5 70.0 2.9 47.0 24.5 0.0 40.0 12.1 1.5 0.0 0.0 0.0 17.9
FCNs [2] V 0.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1
CyCADA [24] V 85.6 30.7 74.7 14.4 13.0 17.6 13.7 5.8 74.6 15.8 69.9 38.2 3.5 72.3 16.0 5.0 0.1 3.6 0.0 29.2
MCD [3] V 86.4 8.5 76.1 18.6 9.7 14.9 7.8 0.6 82.8 32.7 71.4 25.2 1.1 76.3 16.1 17.1 1.4 0.2 0.0 28.8
AdaptSeg [7] V 87.3 29.8 78.6 21.1 18.2 22.5 21.5 11.0 79.7 29.6 71.3 46.8 6.5 80.1 23.0 26.9 0.0 10.6 0.3 35.0
CLAN [5] V 88.0 30.6 79.2 23.4 20.5 26.1 23.0 14.8 81.6 34.5 72.0 45.8 7.9 80.5 26.6 29.9 0.0 10.7 0.0 36.6
Baseline V 93.4 57.6 79.9 23.0 21.3 23.7 15.1 11.7 80.9 37.8 83.5 42.2 9.2 78.4 9.5 0.9 15.4 4.8 3.7 36.4
Ms_cyc V 94.2 62.4 82.5 20.8 30.6 26.9 23.6 22.9 82.3 39.0 87.3 50.5 16.2 79.9 17.7 4.9 11.9 6.6 15.9 40.8
Mproto V 94.4 62.9 82.2 21.4 26.3 27.9 23.8 21.5 84.7 38.5 85.3 51.4 13.9 80.6 14.2 4.1 3.8 3.8 24.0 40.3
Ours (all) V 93.7 58.9 82.7 31.4 28.1 26.8 22.2 22.8 83.5 40.2 86.1 49.0 17.1 78.9 25.4 3.9 20.6 5.8 21.0 42.1
Source-only R 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.0
AdaptSeg [7] R 86.5 25.9 79.8 22.1 20.0 23.6 33.1 21.8 81.8 25.9 75.9 57.3 26.2 76.3 29.8 32.1 7.2 29.5 32.5 41.1
CLAN [5] R 87.0 27.1 79.6 27.3 23.3 28.3 35.5 24.2 83.6 27.4 74.2 58.6 28.0 76.2 33.1 36.7 6.7 31.9 31.4 43.2
MRNet [42] R 89.1 23.9 82.2 19.5 20.1 33.5 42.2 39.1 85.3 33.7 76.4 60.2 33.7 86.0 36.1 43.3 5.9 22.8 30.8 45.5
R-MRNet [43] R 90.4 31.2 85.1 36.9 25.6 37.5 48.8 48.5 85.3 34.8 81.1 64.4 36.8 86.3 34.9 52.2 1.7 29.0 44.6 50.3
Baseline R 93.8 59.4 79.9 21.5 19.9 26.2 22.9 18.9 83.5 40.7 84.7 58.3 25.6 86.1 37.6 39.8 3.7 11.3 10.2 43.4
Ms_cyc R 95.2 67.6 85.0 27.0 30.5 33.0 38.2 47.8 86.6 44.3 85.9 60.3 33.8 86.7 20.6 14.9 24.2 15.7 56.4 50.2
Mproto R 95.2 65.2 85.1 26.4 30.5 34.1 39.1 48.7 86.5 46.4 86.0 62.2 35.2 85.4 8.75 10.4 25.5 24.0 58.4 50.3
Ours (all) R 95.6 68.8 85.6 27.6 35.6 35.4 40.2 45.2 88.3 46.5 87.6 61.3 36.5 86.3 30.8 10.2 32.7 22.4 57.2 52.6
Tab.2 Results of adaptation from SYNTHIA to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold
SYNTHIA → Cityscapes
Method Arch. Road Side. Build. Light Sign Vege. Sky Pers. Rider Car Bus Motor Bike mIoU
Source-only V 6.4 17.7 29.7 0.0 7.2 30.3 66.8 51.5 1.5 47.3 3.9 0.1 0.0 20.2
FCNs [2] V 11.5 19.6 30.8 0.1 11.7 42.3 68.7 51.2 3.8 54.0 3.2 0.2 0.6 22.9
CDA [8] V 65.2 26.1 74.9 3.7 3.0 76.1 70.6 47.1 8.2 43.2 20.7 0.7 13.1 34.8
Cross-city [41] V 62.7 25.6 78.3 1.2 5.4 81.3 81.0 37.4 6.4 63.5 16.1 1.2 4.6 35.7
AdaptSeg [7] V 78.9 29.2 75.5 0.1 4.8 72.6 76.7 43.4 8.8 71.1 16.0 3.6 8.4 37.6
CLAN [5] V 80.4 30.7 74.7 1.4 8.0 77.1 79.0 46.5 8.9 73.8 18.2 2.2 9.9 39.3
Baseline V 89.8 43.6 73.1 2.3 19.1 79.4 77.5 43.8 7.7 74.8 6.5 0.7 15.2 41.0
Ms_cyc V 93.7 56.2 79.6 5.7 16.3 80.4 85.0 47.8 11.4 78.6 6.6 7.1 22.0 45.4
Mproto V 92.8 54.2 78.7 6.1 12.8 81.1 83.5 47.9 9.6 76.6 3.6 9.5 28.0 44.9
Ours V 94.7 60.7 82.6 5.5 19.7 84.3 85.6 52.9 10.7 80.2 9.1 10.2 36.7 48.7
Source-only R 55.6 23.8 74.6 6.1 12.1 74.8 79.0 55.3 19.1 39.6 23.3 13.7 25.0 38.6
AdaptSeg [7] R 79.2 37.2 78.8 9.9 10.5 78.2 80.5 53.5 19.6 67.0 29.5 21.6 31.3 45.9
CLAN [5] R 81.3 37.0 80.1 16.1 13.7 78.2 81.5 53.4 21.2 73.0 32.9 22.6 30.7 47.8
MRNet [42] R 82.0 36.5 80.4 18.0 13.4 81.1 80.8 61.3 21.7 84.4 32.4 14.8 45.7 50.2
R-MRNet [43] R 87.6 41.9 83.1 31.3 19.9 81.6 80.6 63.0 21.8 86.2 40.7 23.6 53.1 54.9
Baseline R 88.1 42.4 79.9 16.4 21.8 80.0 77.1 57.6 24.6 75.5 20.0 11.2 40.5 48.9
Ms_cyc R 94.5 61.3 83.4 16.9 24.0 84.8 88.2 61.6 21.9 84.1 27.8 7.1 49.4 54.2
Mproto R 93.4 56.9 82.7 7.2 27.6 83.5 86.8 60.6 24.0 82.0 22.0 11.5 46.7 52.7
Ours R 93.4 57.5 83.2 18.3 29.0 83.9 87.3 60.1 30.2 83.6 38.3 11.3 49.3 55.8
Tab.3 Comparison of our approach with semi-supervised approaches on the Resnet-101 backbone and Cityscapes (CS) dataset
Method N=375
Train on CS 55.1
Hung et al. [9] 57.1
Tarun et al. [10] 55.9
Mittal et al. [11] 59.3
Ours 59.8
Comparison to semi-supervised domain adaptation In addition to demonstrating the superiority of the proposed method over the unsupervised domain adaptation, we also compare it to the semi-supervised semantic segmentation works. For a fair comparison, we sample similar amounts of labeled target data in Tab.3. For a setting of N=375, which corresponds to using 375 labeled examples from the Cityscapes, our approach achieves the best mIoU value of 59.8% based on Resnet-101, demonstrating its effectiveness over those semi-supervised approaches.

4.4 Ablation studies

Effects of each component In this section, we analyze the effectiveness of the proposed modules, including the data perturbation-based semantic consistency constraint ( Ms_cyc) and the category-level transferable prototype module ( Mproto). In addition, we also evaluate the effectiveness of the proposed category-level adaptation ( Mcate) and task-level adaptation ( Mtask) in the prototype module. Experiments are conducted on the GTA5 Cityscapes task. We construct the aforementioned baseline model following the AdaptSeg [7] under the proposed few-shot domain adaptation setting. As shown in Tab.4, all proposed components bring observable improvements over the baseline. Specifically, the semantic consistency module benefits global adaptation by formalizing more robust representations with the data perturbation strategy. The prototypical module facilitates the adaptation of more tiny objects, e.g., traffic lights, traffic signs, and poles. This is because it helps to learn more discriminative category boundaries. What’s more, from Tab.4 we observe that the task-level adaptation is more powerful than the category-level adaptation, which outperforms the baseline model with 5.4% points on the ResNet-101 while the category-level adaptation is 2.0%. Nonetheless, simultaneously adopting both of them brings the best result with a 6.9% improvement.
Tab.4 Ablation studies of proposed modules
GTA5→Cityscapes
Method VGG ResNet
Baseline 36.4 43.4
Ms_cyc 40.8 50.2
Mcate 38.3 45.4
Mtask 39.2 48.8
Mproto 40.3 50.3
Ours 42.1 52.6
Category-level transferable prototypical networks analysis The qualitative analysis is produced in this section to demonstrate the transferable prototype module can effectively align the category-level distributions between the source and the target domains. We first present the colored prediction results of randomly sampled images in the testing set as shown in Fig.2. We can observe that our approach yields better segmentation outputs with the correct prediction on tiny objects, e.g., pole and traffic light. We then use the t-SNE approach [46] to visualize the feature distributions obtained by different methods in Fig.3. For clarity, we only show four different categories, i.e., building in blue, pole in orange, traffic sign in red and traffic light in green. Compared to the baseline method, our approach aligns better in the category-level representations, and different categories are more separable with a larger metric distance.
Fig.2 Visualization of prediction results of different methods

Full size|PPT slide

Fig.3 Visualization of the t-SNE results of different methods

Full size|PPT slide

Data perturbation-based semantic consistency To further demonstrate the effectiveness of the proposed semantic consistency constraint ( Ms_cyc), we provide the evaluation results on different training epochs in Fig.4. We compare the baseline method, the proposed consistency module, and the method with both the consistency constraint and the transferable prototype module. The performance of the baseline method without semantic consistency degrades when the epoch arrives at 40. When applying Ms_cyc, the accuracy of the proposed method continually increases with a significant improvement. Moreover, our final method has the same trends and achieves higher performance. We also observe that along the process of longer training epochs (until 150 eps), our approach only fluctuates near the best results in a relatively small range ±0.9%, while the baseline approach fluctuates much severely in the between 21% and 30%. These results indicate that the proposed data perturbation-based semantic consistency constraint can effectively enhance the generalization and robustness of models. At the same time, it overcomes the overfitting problem due to few labeled samples.
Fig.4 Evaluation of the performance along different training epochs. The data perturbation-based semantic consistent effectively alleviates the overfitting problem

Full size|PPT slide

N-shot analysis We also evaluate the N-shot scenarios to provide insights of the performance changing with different numbers of the annotated target images. Specifically, we report the 1-shot, 2-shot, 3-shot, 5-shot, and 10-shot results in Tab.5. It is noteworthy that even with the extreme 1-shot case, our approach achieves the performance of 46.9% mIoU with the ResNet, outperforming many UDA methods like the AdaptSeg [7] and CLAN [5] with a large margin. What’s more, our 10-shot achieves comparable performance to the full supervision setting in some categories, such as road, building, and vegetation with accuracy 95.6% to 96.4%, 83.8% to 86.6%, and 85.4% to 87.1% based on VGG16 networks. These results demonstrate that our approach is prominent in balancing the annotation burden and the model performance and is beneficial to practical domain transformation applications.
Tab.5 Results of N-shot target samples
GTA5→Cityscapes
Method VGG ResNet
1-shot 36.8 46.9
2-shot 37.6 48.8
3-shot 38.6 49.7
5-shot 42.1 52.6
10-shot 48.6 56.5
Full 58.5 65.1
Data perturbation results The semantic consistency constraint relies on the data perturbation strategy to learn. The data perturbation function is helpful to learn more robust representations by providing more different samples in expanded style space. We check the data perturbation results on the Cityscapes dataset in Fig.5, and confirm that it can provide more style information while maintaining the content structure of the original images.
Fig.5 Visualization of data perturbation for Cityscapes. The first row indicates the original images, and the second row indicates the augmented images

Full size|PPT slide

5 Conclusions

In this paper, we propose a novel few-shot domain adaptation problem for the semantic segmentation task, which leverages few-shot target domain labeled images for training. To effectively employ the few-shot labels, we propose the data perturbation-based strategies and transferable prototype-based modules, which help to learn more robust representations and align the cross-domain features. We conducted detailed experiments to analyze the proposed approach and demonstrated the effectiveness of our method in utilizing the few-shot labels for domain adaptation. Specifically, we demonstrate that the proposed approach is prominent in balancing the annotation burden and model performance, showing its value in practical domain adaptation applications.

Acknowledgements

This work was supported in part by the National Key R&D Program of China (2019QY1604), the Major Project for New Generation of AI (2018AAA0100400), the National Youth Talent Support Program, and the National Natural Science Foundation of China (Grant Nos. U21B2042, 62006231, and 62072457).
1
Chen L C , Papandreou G , Kokkinos I , Murphy K , Yuille A L . Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 4): 834– 848

2
Hoffman J Wang D Yu F Darrell T. FCNs in the wild: pixel-level adversarial and constraint-based adaptation. 2016, arXiv preprint arXiv: 1612.02649

3
Saito K Watanabe K Ushiku Y Harada T. Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3723– 3732

4
Wang Y Peng J Zhang Z. Uncertainty-aware pseudo label refinery for domain adaptive semantic segmentation. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9072– 9081

5
Luo Y Zheng L Guan T Yu J Yang Y. Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2502– 2511

6
Zhang Y Qiu Z Yao T Liu D Mei T. Fully convolutional adaptation networks for semantic segmentation. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6810– 6818

7
Tsai Y H Hung W C Schulter S Sohn K Yang M H Chandraker M. Learning to adapt structured output space for semantic segmentation. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7472– 7481

8
Zhang Y David P Gong B. Curriculum domain adaptation for semantic segmentation of urban scenes. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 2039– 2049

9
Hung W C Tsai Y H Liou Y T Lin Y Y Yang M H. Adversarial learning for semi-supervised semantic segmentation. In: Proceedings of British Machine Vision Conference 2018. 2018, 65

10
Kalluri T Varma G Chandraker M Jawahar C V. Universal semi-supervised semantic segmentation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5258– 5269

11
Mittal S Tatarchenko M Brox T. Semi-supervised semantic segmentation with high- and low-level consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(4): 1369– 1379

12
Goodfellow I J Pouget-Abadie J Mirza M Xu B Warde-Farley D Ozair S Courville A Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014, 2672– 2680

13
Lim S Kim I Kim T Kim C Kim S. Fast autoaugment. In: Proceedings of Neural Information Processing Systems 32. 2019, 6662– 6672

14
Liu T Yang Q Tao D. Understanding how feature structure transfers in transfer learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 2365– 2371

15
Ge P Ren C X Dai D Q Yan H. Domain adaptation and image classification via deep conditional adaptation network. 2020, arXiv preprint arXiv: 2006.07776

16
Wittich D , Rottensteiner F . Appearance based deep domain adaptation for the classification of aerial images. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 180: 82– 102

17
He Z Zhang L. Multi-adversarial faster-RCNN for unrestricted object detection. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6667– 6676

18
He Z Zhang L. Domain adaptive object detection via asymmetric tri-way faster-RCNN. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 309– 324

19
Song L , Xu Y , Zhang L , Du B , Zhang Q , Wang X . Learning from synthetic images via active pseudo-labeling. IEEE Transactions on Image Processing, 2020, 29: 6452– 6465

20
Gao L Zhang J Zhang L Tao D. DSP: dual soft-paste for unsupervised domain adaptive semantic segmentation. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 2825– 2833

21
Pape C , Matskevych A , Wolny A , Hennies J , Mizzon G , Louveaux M , Musser J , Maizel A , Arendt D , Kreshuk A . Leveraging domain knowledge to improve microscopy image segmentation with lifted multicuts. Frontiers in Computer Science, 2019, 1: 6

22
Quan T M , Hildebrand D G C , Jeong W K . Fusionnet: a deep fully residual convolutional neural network for image segmentation in connectomics. Frontiers in Computer Science, 2021, 3: 613981

23
Baniukiewicz P , Lutton E J , Collier S , Bretschneider T . Generative adversarial networks for augmenting training data of microscopic cell images. Frontiers in Computer Science, 2019, 1: 10

24
Hoffman J Tzeng E Park T Zhu J Y Isola P Saenko K Efros A A Darrell T. CyCADA: cycle-consistent adversarial domain adaptation. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1989– 1998

25
Yao T Pan Y Ngo C W Li H Mei T. Semi-supervised domain adaptation with subspace learning for visual recognition. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2142– 2150

26
Saito K Kim D Sclaroff S Darrell T Saenko K. Semi-supervised domain adaptation via minimax entropy. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 8049– 8057

27
Zhang H Cisse M Dauphin Y N Lopez-Paz D. Mixup: beyond empirical risk minimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018

28
DeVries T Taylor G W. Improved regularization of convolutional neural networks with cutout. 2017, arXiv preprint arXiv: 1708.04552

29
Yun S Han D Chun S Oh S J Yoo Y Choe J. CutMix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6022– 6031

30
Cubuk E D Zoph B Mane D Vasudevan V Le Q V. Autoaugment: learning augmentation policies from data. 2018, arXiv preprint arXiv: 1805.09501

31
Zoph B Cubuk E D Ghiasi G Lin T Y Shlens J Le Q V. Learning data augmentation strategies for object detection. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 566– 583

32
Zhang L Zhou Y Zhang L. On the robustness of domain adaption to adversarial attacks. 2021, arXiv preprint arXiv: 2108.01807

33
Koch G Zemel R Salakhutdinov R. Siamese neural networks for one-shot image recognition. In: Proceedings of the 32nd International Conference on Machine Learning. 2015

34
Vinyals O Blundell C Lillicrap T Wierstra D. Matching networks for one shot learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 3637– 3645

35
Snell J Swersky K Zemel R. Prototypical networks for few-shot learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 4077– 4087

36
Bergstra J S Bardenet R Bengio Y Kégl B. Algorithms for hyper-parameter optimization. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 2546– 2554

37
Miyato T , Maeda S I , Koyama M , Ishii S . Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41( 8): 1979– 1993

38
Richter S R Vineet V Roth S Koltun V. Playing for data: Ground truth from computer games. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 102– 118

39
Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson R Franke U Roth S Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3213– 3223

40
Ros G Sellart L Materzynska J Vazquez D Lopez A M. The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3234– 3243

41
Chen Y H Chen W Y Chen Y T Tsai B C Wang Y C F Sun M. No more discrimination: cross city adaptation of road scene segmenters. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 2011– 2020

42
Zheng Z Yang Y. Unsupervised scene adaptation with memory regularization in vivo. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2021, 150

43
Zheng Z , Yang Y . Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision, 2021, 129( 4): 1106– 1120

44
Loshchilov I Hutter F. SGDR: stochastic gradient descent with warm restarts. In: Proceedings of the 5th International Conference on Learning Representations. 2016

45
Paszke A Gross S Chintala S Chanan G Yang E DeVito Z Lin Z Desmaison A Antiga L Lerer A. Automatic differentiation in PyTorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017

46
van der Maaten L , Hinton G . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9( 86): 2579– 2605

Outlines

/