Toward few-shot domain adaptation with perturbation-invariant representation and transferable prototypes

Junsong FAN; Yuxi WANG; He GUAN; Chunfeng SONG; Zhaoxiang ZHANG

doi:10.1007/s11704-022-2015-7

Frontiers of Computer Science >

2022 , Vol. 16 >Issue 3: 163347

DOI: https://doi.org/10.1007/s11704-022-2015-7

RESEARCH ARTICLE

Toward few-shot domain adaptation with perturbation-invariant representation and transferable prototypes

Junsong FAN ¹^,² ,
Yuxi WANG ¹^,²^,³ ,
He GUAN ¹^,² ,
Chunfeng SONG ¹^,² ,
Zhaoxiang ZHANG ^,¹^,²^,³

Expand

¹. Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
². University of Chinese Academy of Sciences, Beijing 100049, China
³. Centre for Artificial Intelligence and Robotics, HKISI_CAS, HongKong 999077, China

Received date: 08 Jan 2022

Accepted date: 09 Mar 2022

Published date: 15 Jun 2022

Copyright

2022 Higher Education Press

Fold

Abstract

Domain adaptation (DA) for semantic segmentation aims to reduce the annotation burden for the dense pixel-level prediction task. It focuses on tackling the domain gap problem and manages to transfer knowledge learned from abundant source data to new target scenes. Although recent works have achieved rapid progress in this field, they still underperform fully supervised models with a large margin due to the absence of any available hints in the target domain. Considering that few-shot labels are cheap to obtain in practical applications, we attempt to leverage them to mitigate the performance gap between DA and fully supervised methods. The key to this problem is to leverage the few-shot labels to learn robust domain-invariant predictions effectively. To this end, we first design a data perturbation strategy to enhance the robustness of the representations. Furthermore, a transferable prototype module is proposed to bridge the domain gap based on the source data and few-shot targets. By means of these proposed methods, our approach can perform on par with the fully supervised models to some extent. We conduct extensive experiments to demonstrate the effectiveness of the proposed methods and report the state-of-the-art performance on two popular DA tasks, i.e., from GTA5 to Cityscapes and SYNTHIA to Cityscapes.

Key words： domain adaptation; semantic segmentation

Cite this article

Junsong FAN, Yuxi WANG, He GUAN, Chunfeng SONG, Zhaoxiang ZHANG. Toward few-shot domain adaptation with perturbation-invariant representation and transferable prototypes[J]. Frontiers of Computer Science, 2022, 16(3): 163347. DOI: 10.1007/s11704-022-2015-7

1 Introduction

With the help of large-scale datasets, deep learning-based approaches have shown unprecedented advances in many applications. However, the acquisition of well-labeled large-scale datasets is time-consuming and expensive, limiting the application of these advanced deep models. This problem is even more severe in the dense pixel-wise takes like the semantic segmentation, where full annotations require precise per-pixel class discrimination [1]. Meanwhile, models trained with the existing source scenes generally perform poorly when generalized to new target scenes due to the well-known domain gap problem. Thus, it is necessary to finetune the model on the target domain with some data. To achieve acceptable performance on the target scenes while minimizing the data annotation burden, a natural idea is adapting the model from the pretrained labeled source domain to the unlabeled target domain, i.e., unlabeled domain adaptation (UDA) for semantic segmentation [2-5].

The UDA approach is developed to learn domain-invariant representations and improve the model’s performance on the target domain without knowing target domain labels. Most UDA methods conduct this task by matching different levels of distributions between the source and the target domains to suppress the domain shifting, such as pixel-level [6] and feature-level [2, 7, 8] distributions. However, due to the absence of supervision on the target domain, there is still a significant performance gap between the state-of-the-art UDA and fully supervised scenarios, especially for pixel-level semantic segmentation tasks. The semi-supervised domain adaptation (SSDA) [9-11] proposes to alleviate this problem by using a proportion of labeled data in the target domain. However, the required annotation amount is generally large, significantly increasing the annotation burden.

Alternatively, in this work, we hope to imitate the human ability that can learn from very little supervision to improve the domain adaptation performance while minimizing the annotation burden. To this end, we propose to conduct the few-shot domain adaptation (FSDA) for semantic segmentation, where only a fixed number of few labels are available in the target domain. Compared to earlier SSDA, the proposed FSDA adopts extremely fewer labeled samples for training, which is much easier to obtain. However, the FSDA model faces severe challenges of overfitting with the very few target samples, getting stuck in class imbalance problems, and encountering absent classes not present in target labels. Compared to the UDA, with the help of the few labels, the representations of each category can be reliably well-aligned across the source and the target domains to achieve a consistent representation space. As a result, the models’ performance can be significantly improved.

We focus the design on effectively leveraging the few-shot labels on the target domain to help mitigate the domain gap. To this end, in addition to the common semantic segmentation [1] and adversarial loss [12] for the DA, we propose two main components to help the FSDA problem. Firstly, we propose a transferable prototype-based module to align the source and the target domains with the help of few-shot labels on the target set. Unlike the instance-level or pixel-level representations that are vulnerable, the prototypes are usually more robust and stable because of the context across different samples. Thanks to the few-shot labels, the representations of each category can be reliably expressed with the prototypes and then aligned across domains. We designed two kinds of prototype transferring objectives to align the domains, i.e., the category-level and the task-level constraints, which perform per-category and the whole distribution alignment, respectively.

Secondly, we propose a data perturbation-based approach to help further enhance the representation robustness when only the few-shot labels are available. Unlike the SSDA problem, the proposed FSDA holds much fewer annotations on the target set to learn. As a result, a naive training procedure with these few-shot labels would lead to severe overfitting problems. To this end, we propose to apply the data perturbation technique to help alleviate this problem. To better adapt to the distribution of the data and inspired by previous auto-augmentation work [13], we search the perturbation function for the semantic segmentation task with the objective that the perturbed candidates should be consistent with the original candidates in the semantics level. Experiments demonstrate that the proposed approach can effectively enhance the representation robustness and improve the model’s generalization ability.

We summarize our contributions as follows:

● We define a novel type of domain adaptation, i.e., few-shot domain adaptation (FSDA), which pursues fully-supervised performance while minimizing the annotation burden when adapting to new domains. We believe this setting is more close to the applications in the real world.

● We propose a transferable prototype-based module to align the source and the target domains with only the few-shot labels. Comprehensive experiments demonstrate the effectiveness of our method, especially with few-shot labels.

● We introduce data perturbation-based approach to help alleviate the overfitting problem when only a few labeled data are available. Experiments demonstrate our approach can effectively improve representation consistency and model robustness for the domain adaptation problem.

2 Related work

Domain adaption is a challenging task that focuses on the model transferability [14] to the new data domain. This problem is popular in the computer vision for classification [15, 16], object detection [17-19], and segmentation [2, 19, 20]. This work focuses on the challenging domain adaptation for the semantic segmentation task, which requires accurate dense pixel-level predictions while lacking supervision in the target domain. This task is attractive in the field of segmentation to save annotation burdens for various applications [21,22]. The first work introducing the task is FCNs [2], which applies the adversarial training strategy [12, 23] on a segmentation network with the source and the target domains. Following the adversarial training strategy, numerous methods [5-8,24] have been explored. The main insight behind these approaches is to align the distributions of image representation from the source and the target domains. They can be roughly categorized into three types according to the adversarial information: feature-level adaptation [2, 7, 8] focuses on the distribution shift in the latent feature space; pixel-level adaptation [6] refers to style transfer to generate indistinguishable predictions for both domains; simultaneous feature-level and pixel-level adaptation [24] aims to tackle the problem by considering both the feature and pixel-level information. Furthermore, MCD [3] proposes a novel method considering task-specific decision boundaries. APL [19] explores the activate pseudo-labeling strategy to alleviate the domain gap. DSP [20] proposes an efficient soft-paste strategy for the segmentation of domain adaptation problems. CLAN [5] focuses on category-shift by aligning each category distribution. Although these methods have achieved significant improvement, there still exists a large gap between source and target distributions. Consequently, the domain adaptation models underperform the fully supervised models with a large margin. We noticed that in real-world applications, it is generally convenient to obtain few-shot labels on the target scenario. Therefore, we propose to apply few-shot domain adaptation to pursue fully supervised performance while minimizing the annotation burden. Different from the semi-supervised domain adaptation [25, 26] that requires much more annotations on the target domain, we take advantage of the information of sufficient unlabeled target domain images and only require few-shot target labels as hints.

Data perturbation has been widely studied recently in the field of deep learning, especially for image recognition and object detection tasks. These approaches enhance the model performance by providing augmented data and training targets. Typical augmentation operations are random cropping, flipping, rotation, scaling, geometry transformation, image mixing, color jittering, and so on. There are generally two strategies to apply the data perturbations. The first series apply manually designed rules to perform the perturbation, such as Mixup [27], Cutout [28], and CutMix [29]. On the other hand, some works apply automatically searched policies to conduct the perturbation, such as AutoAugment [30], AugDetection [31], and FastAutoAug [13]. Our approach with the data perturbation follows the second group works, which applies learned augmentation policies to enhance the model's robustness and generalization ability for the few-shot domain segmentation task. It is noteworthy that previous auto-augmentation methods cannot be directly applied to our semantic segmentation tasks due to the label misalignment problem caused by the geometric transformation with the images. To this end, we generate a specialized data augmentation strategy appropriate for the segmentation task. The adversarial attack strategy [32] is also widely applied to learn robust representations for the original samples. Differently, our data perturbation strategy concerns more with the new data generalization process that covers both the source and the target domains.

Few-shot learning aims to transfer the knowledge learned from the abundant training set with labels to the target novel classes so that the model can conduct fast learning with few labeled data on the novel classes. The existing few-shot learning methods can be divided into four types: learning to finetune models, learning recurrent neural networks with memories, network parameters prediction, and metric learning. Our approach is more related to the metric learning-based methods in the few-shot classification tasks. In this field, Siamese Networks [33] utilize a two-stream convolutional network to construct metric learning. Matching Networks [34] capture the similarity between one query and multiple support images. Prototypical Networks [35] learn a metric space to compute distances between datapoint and prototype representations for each class. In our few-shot domain adaptation task, there is no category difference between the source and target domain, and the domain discrepancy is the main challenge to influence the performance. Therefore, we apply prototypical networks to model category-level domain-invariant feature representations among source and target domains, utilizing their generalization capability from a few-shot image to the entire dataset.

3 Approach

Preliminary We first elaborate on the problem setting of our few-shot domain adaptation (FSDA) task for semantic segmentation. The task contains a source domain dataset, where both the image and corresponding label are available, and a target domain dataset, where only a few images are coupled with labels and others are unlabeled images. Formally, let

X_{s}

and

X_{t}

denote the sets of source domain images and target domain images, respectively. We have access to

n_{s}

source domain samples, each contains an image

x_{s}^{i}

and the corresponding semantic segmentation label

y_{s}^{i}

, i.e.,

I_{s} = {(x_{s}^{i}, y_{s}^{i})}_{i = 1}^{n_{s}}

, and few target domain samples with labels,

I_{t} = {(x_{t}^{i}, y_{t}^{i})}_{i = 1}^{n_{t}}

. Generally, the number of target domain samples are much less than the source domain’s,

n_{t} ≪ n_{s}

. Additionally, we have access to abundant unlabeled target samples,

I_{u} = {(x_{t}^{i})}_{i = 1}^{n_{u}}

. The goal is to employ these data to train and finetune a model to perform the semantic segmentation task on the target domain, and the test data are unseen data in the target domain.

3.1 Overview of the proposed model

Our approach contains three major components. The first one is the adversarial semantic segmentation training procedure, which is widely applied in the field of domain adaptation for semantic segmentation [5, 7]. It contains the main semantic segmentation model

G

to produce the segmentation results and an additional discriminator

D

for adversarial training across the domains. It applies both the source and the target domain data with labels, i.e.,

I_{s}

and

I_{t}

for the segmentation loss and additionally applies the unlabeled data

I_{u}

for the adversarial loss. The second component is the proposed transferable prototype-based module to help bridge the gap between the source and the target domain, where two kinds of alignment strategies are proposed, i.e., the category-level and the task-level alignment. The last one is the data perturbation strategy to alleviate the overfitting problem on the few labeled target samples, which adapts the perturbation policy designed for the semantic segmentation task for the given data. We elaborate on the details of these methods in the following sections.

3.2 Adversarial semantic segmentation

Following the common practice in the field of DA for semantic segmentation [5, 7], our method basically adopts the adversarial training strategy for semantic segmentation. We first introduce the training objective of the adversarial learning. Given the source domain images

X_{s}

and unlabeled target domain images

X_{u}

, the goal is to help the segmentation model

G

to learn domain invariant predictions from these two domains. It is achieved by the adversarial training with an auxiliary discriminator, where the discriminator is trained to distinguish the predictions from the two domains while the segmentation model is trained to fool the discriminator to provide indistinguishable predictions. This procedure is implemented by the min-max game:

\begin{aligned} {\begin{aligned} L_{a d v} (G, D) & = min_{D} [E (G, D)] + max_{G} [E (G, D)], \\ E (G, D) & = - E [\log (D (G (x_{s}))] - E [\log (1 - D (G (x_{u})))], \end{aligned} \end{aligned}

(1)

where

x_{s} \in X_{s}

and

x_{u} \in X_{u}

and the images from the source and the target domains, respectively.

E [\cdot]

stands for the expectation of the value of the distribution.

On the other hand, we introduce the semantic segmentation loss to optimize the segmentation model

G

. The loss is minimized with both the labeled source samples and the few-shot labeled target samples. Given images

x_{s} \in X_{s}

and

x_{t} \in X_{t}

with annotations

y_{s}

and

y_{t}

, we formalize the semantic segmentation loss as:

\begin{aligned} L_{s e g} (G) = \sum_{i = 1}^{n_{s}} L_{s e g} (G; (x_{s}^{i}, y_{s}^{i})) + \sum_{i = 1}^{n_{t}} L_{s e g} (G; (x_{t}^{i}, y_{t}^{i})), \end{aligned}

(2)

where

L_{s e g} (\cdot, \cdot)

denotes multi-category cross-entropy loss. Given

x \subset R^{3 \times H \times W}

and

y \subset R^{C \times H \times W}

L_{s e g} (\cdot, \cdot)

is defined as:

\begin{aligned} L_{s e g} (G; (x, y)) = \sum_{i = 1}^{H \times W} \sum_{c = 1}^{C} - y_{i, c} \log p_{i, c}, \end{aligned}

(3)

where

p_{i, c}

denotes the predicted probability of class

c

on pixel

i

, which is obtained by the segmentation model

G

’s prediction that normalized by the Softmax function, and

y_{i, c}

denotes corresponding the ground truth of class

c

on pixel

i

3.3 Adaptive data perturbation

In this section, we introduce how to implement the adaptive data perturbation strategy to enhance the model robustness in the few-shot domain adaptation task. As been pointed by previous works [6, 24], the domain shift can be largely attributed to the style difference. The styles are typically represented as the color tunes, brightness, saturation, and so on. These methods adopt the style-transfer [6] or pixel-level image generation [24] to try to address this problem. From the view of style transferring, our adaptive data perturbation approach can also alleviate the style-shifting problem. However, different from previous approaches that narrow down the style space for the source and/or target domains to align them, our approach transfers the data to a larger style space and hopes it covers both the source and the target domains. Therefore, our data perturbation strategy can provide better generalization and avoid the model overfitting to some specific styles.

We adopt the efficient RL-based augmentation method [13] to realize the adaptive data perturbation strategy, which has been successfully demonstrated in images classification tasks. To adapt the approach for segmentation tasks, we design a new search space containing nine operations, which exclude operations such as the affine transformation and image rotation that change the distributions and structures of segmentation labels. The semantic segmentation augmentation policies are learned by utilizing the Bayesian hyperparameters optimization. We optimize the parameters under the synthetic dataset with the K-folds method to adapt the perturbation policies to the segmentation task. Then we use HyperOpt [36] to search the optimal augmentation policies. Once policies are obtained, we fix them and treat them as the perturbation functions to extend the dataset online during the training process.

As a direct utilization of the data perturbation strategy, we propose to apply the semantic consistency constraint. Intuitively, a well-optimized domain-invariant model should produce predictions invariant to the perturbation of the input image. We leverage this property to train the model as a complementary constraint when there are no ground truth labels for training. Specifically, we take the predictions of the original examples as the target distributions to optimize the perturbed predictions and apply the KL-divergence constraint between these two predictions. Given unlabeled target images,

I_{u}

, the corresponding perturbed images are represented as

I_{a}

, which are obtained by the aforementioned perturbation policies. Then, both

I_{u}

and

I_{a}

are fed into the segmentation model

G

to produce the segmentation predictions, which are normalized by the Softmax function and denoted by

P_{u}

and

P_{a}

, respectively. Following VAT [37], we define our semantic consistency loss as:

\begin{aligned} L_{s_c y c} (G) = D_{K L} [P_{u} (G; x_{u}) ∥ P_{a} (G; x_{a})], \end{aligned}

(4)

where,

x_{u}

and

x_{a}

are drawn from

I_{u}

and

I_{p}

, respectively. The parameters are trained on

P_{a} (G, x_{a})

while fixed on

P_{u} (G, x_{u})

3.4 Transferable prototype module

With the aid of the source domain labels and few-shot labels in the target domain, we can obtain reliable prototypes representative of the corresponding categories. Based on the noiseless prototypes, we propose the transferable prototype module to align both the intermediate features and final predictions across the two domains. Specifically, a category-level adaptation and a task-level adaptation are proposed. In category-level adaptation, we directly match the prototypes of the same category obtained from samples in both the source and the target domains. As shown in Fig.1, the prototypes of the same category

k

in the source domain (

c_{k}

) and the target domain (

{\hat{c}}_{k}

) should be close to each other. By means of this, the feature representations of the two domains are implicitly aligned. In the task-level adaptation, we minimize the discrepancy of predictions from the source and target prototype based classifiers for unlabeled target samples, which further helps to align the domains through the prediction distributions. We utilize the above two components simultaneously to optimize the network

G

Fig.1 Source images $I_{s}$ , labeled target images $I_{t}$ , unlabeled target images $I_{u}$ and perturbed images $I_{a}$ are forwarded into the segmentation networks $G$ . The corresponding latent features are represented by cuboid with different colors. The features obtained from $I_{s}$ and $I_{t}$ are trained for segmentation, and $I_{s}$ and $I_{u}$ are used to train discriminator $D$ . Furthermore, $I_{a}$ and $I_{u}$ construct the semantic consistency constraint and all the features are used to train the transferable prototypical networks

Full size|PPT slide

Transferable prototypes for FSDA We first illustrate the details to compute the prototypes and apply them to the prototype-based classifiers. We remold Prototypical Networks [35] for the scenario of semantic segmentation. Given a pair of the image and the corresponding label

(x, y)

, let

f (G; x) \in R^{H W \times d}

denote the feature embedding of the input image

x

through the model

G

, where

d

is the embedding dimension. The prototype of the

k

th class is computed as:

\begin{aligned} c_{k} = \frac{1}{N_{k}} \sum_{i = 1}^{H W} f_{i} (G; x) \cdot I [y_{i} = k], \end{aligned}

(5)

where,

f_{i}

denotes the

d

-dimensional embedding of the pixel

i

y_{i}

denotes its label,

I [\cdot]

is the identical function that equals 1 when the condition is true or else 0,

N_{k} = \sum_{i = 1}^{H W} I [y_{i} = k]

is a reduction factor.

With the prototypes obtained by Eq. (5), we can utilize them as the classifier weights and further obtain the prototype-based segmentation probabilities:

\begin{aligned} \hat{P} (y_{i} = k | x) = \frac{\exp (- D (f_{i}, c_{k}))}{\sum_{j = 1}^{C} \exp (- D (f_{i}, c_{j}))}, \end{aligned}

(6)

where

\hat{P}

represents the prototype-based probabilities.

D

is the Euclidean distance function following [35], which measures the similarity between the per-pixel feature representations and the prototypes.

C

is the total number of classes in the dataset.

Category-level adaptation Category-shift remains a serious problem in domain adaptation even when the entire distribution of the source domain matches the target domain perfectly. Therefore, it is necessary to align category-level distributions. We achieve this goal by minimizing the prototype distributions between the source and the target domains. In particular, we obtain two kinds of prototypes based on labeled source images and few-shot target images, respectively:

\begin{aligned} {\begin{aligned} c_{k}^{s} = \frac{1}{N_{k} (y^{s})} \sum_{i = 1}^{H W} f_{i} (G; x^{s}) \cdot I [y_{i}^{s} = k], \\ c_{k}^{t} = \frac{1}{N_{k} (y^{t})} \sum_{i = 1}^{H W} f_{i} (G; x^{t}) \cdot I [y_{i}^{t} = k], \end{aligned} \end{aligned}

(7)

where

N_{k} (y^{s})

and

N_{k} (y^{t})

denote the number of pixels in the image

x^{s}

and

x^{t}

of category

k

, respectively.

It should be noted that the prototypes obtained from the few-shot images have a bias between the prototypes from the entire dataset. To resolve this problem, we calculate target prototypes by both few-shot target images and data perturbation augmented target images with labels. To address the tricky case that the few labeled images only contain a subset of categories in the source domain, we ignore the categories which are absent in few-shot target images. However, in most cases, the few-shot target images reasonably contain major categories and share common objects and spatial layouts. We minimize the distances of prototypes within the same category across domains:

\begin{aligned} L_{c a t e} (c_{k}^{s}, c_{k}^{t}) = \sum_{k = 1}^{C} ∥ c_{k}^{s} - c_{k}^{t} ∥^{2} . \end{aligned}

(8)

Task-level adaptation is further proposed to reduce the domain discrepancy. Given an image, a well-optimized model should produce the same prediction probabilities with both the source and the target prototype-based classifiers. This necessary condition provides a new constraint with the prototypes that depends on the task predictions. We implement the constraint by minimizing KL-divergence for the distribution of the predictions. Specifically, considering unlabeled target image

x_{u}

, the predictions obtained from the source and the target prototype classifiers are

{\hat{P}}_{s} (G; x_{u})

and

{\hat{P}}_{t} (G; x_{u})

according to Eq. (6). Our task-level adaptation is defined as follows:

\begin{aligned} L_{t a s k} (x_{u}) = D_{K L} [{\hat{P}}_{s} (G; x_{u}) ∥ {\hat{P}}_{t} (G; x_{u})], \end{aligned}

(9)

Our final loss for category-level transferable prototypical networks is:

\begin{aligned} L_{p r o t o} (G; x_{s}, x_{t}, x_{u}) = L_{c a t e} (c_{k}^{s}, c_{k}^{t}) + L_{t a s k} (x_{u}) . \end{aligned}

(10)

3.5 Training objective

The overall training objective of our adversarial few-shot domain adaptation for semantic segmentation contains all aforementioned losses:

\begin{aligned} L = & L_{s e g} (G) + λ_{a d v} L_{a d v} (G, D) \\ + λ_{s_c y c} L_{s_c y c} (G) + λ_{p r o t o} L_{p r o t o} (G), \end{aligned}

(11)

λ_{a d v}

λ_{s_c y c}

, and

λ_{p r o t o}

are the weights to balance these losses.

4 Experiments

4.1 Datasets

We evaluate our method on the popular domain adaptation semantic segmentation tasks: GTA5 [38]

\to

Cityscapes [39] and SYNTHIA [40]

\to

Cityscapes. To the best of our knowledge, these tasks are the standard evaluations for the domain adaptation semantic segmentation performance. Both GTA5 and SYNTHIA are synthetic datasets that are generated using computer graphics. Thus it is convenient and cheap to obtain many samples with precise ground truth labels. In contrast, the Cityscapes dataset is obtained in real-world scenes and is hard to annotate. We apply them as the source datasets and take the real-world Cityscapes as the target dataset to conduct the adaptation task. GTA5 contains 24,966 images with a resolution of

1914 \times 1024

. SYNTHIA consists of 9,400 images with a resolution of

1280 \times 760

. Cityscapes is a real-world dataset containing high-quality images of street scenes collected from 18 different cities. It contains 2,975 images for training and 500 images for validation, and the resolution is

2048 \times 1024

. We randomly select

N

samples from each of the 18 cities in Cityscapes for the

N

-shot domain adaptation task. Consequently, there are 90 labeled images totally in the target set for the 5-shot domain adaptation semantic segmentation. All the remaining training images are taken as the unlabeled ones for adaptation.

4.2 Implementation details

We first learn a data perturbation policy for the semantic segmentation problem. Following the algorithm [13], we search on the GTA5 dataset, and the obtained final perturbation strategy is applied in all the following experiments. To obtain the perturbation strategy compatible with the segmentation task, we only consider color operations as the search space, not altering the pixel-level labels’ structure and distribution. Specifically, there are nine operations in consideration, i.e., AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, and Sharpness. These operations are chosen not to change the distribution and structure of the segmentation labels. Unlike the classifications that only require global image tags, the segmentation task needs to predict accurate pixel-level labels and relies on the position contexts. Hence augmentation strategies significantly changing the label structure are generally harmful to the segmentation, such as affine transformation and image rotation. We utilize the DeepLab-v2 [1] framework pre-trained on the ImageNet dataset as our segmentation detector. And we follow the hyper-parameters in [13] for training. Briefly, we train the model using SGD optimizer with the learning rate of

2.5 \times 10^{- 4}

, weight decay of

5 \times 10^{- 4}

, the momentum of

0.9

, and the cosine learning decay with one annealing cycle [44].

Considering our few-shot domain adaptation training, we adopt the DeepLab-v2 framework as our segmentation network

G

. We follow the approach in AdaptSeg [7] as the baseline method. It is noteworthy that the multi-level adversarial strategy is not applied to save the computation memory. To make a fair and comprehensive comparison, we experiment with different backbones, i.e., VGG16 and ResNet-101, to evaluate the proposed approach. For discriminator network

D

, we construct five convolutional layers with the kernel size

4 \times 4

, channel numbers of

{64, 128, 256, 512, 1}

and stride of 2. Each layer is followed by a Leaky-ReLU parameterized by 0.2 except the last layer. During training, we use SGD optimizer for

G

with a momentum of 0.9 and a weight decay of

10^{- 4}

. The initial learning rate is set as

2.5 \times 10^{- 4}

and decayed by a poly learning rate policy with a power of 0.9. In our experiment, we follow previous work [5] to set

λ_{a d v} = 0.001

, and set

λ_{s_c y c} = 1.0

and

λ_{p r o t o} = 0.01

by validation experiments. In addition, we use the Adam optimizer to optimize

D

with

β_{1} = 0.9

β_{2} = 0.99

. We initialize the learning rate as

10^{- 4}

and adopt the same poly decay scheduler as the segmentation network.

We use PyTorch [45] for implementation. Performances are evaluated by the Intersection-of-Union (IoU) metric.

4.3 Experimental results

In this section, we present the experimental results on adaptation tasks GTA5

\to

Cityscapes and SYNTHIA

\to

Cityscapes, respectively. We evaluate the proposed few-shot domain adaptation method under different settings. We first report source-only results that are trained on the synthetic dataset GTA5 or SYNTHIA and directly evaluated on the real-world Cityscapes dataset. Furthermore, we compare our method with the state-of-the-art UDA methods for semantic segmentation, including CDA [8], Cross-city [41], CyCADA [24], AdaptSeg [7], CLAN [5] and so on. For a fair comparison, we directly report the results following their original papers. We construct the baseline model following the standard UDA method AdaptSeg. Based on the baseline model, we evaluate our proposed semantic consistency constraint (

M_{s_c y c}

) and transferable prototype module (

M_{p r o t o}

), respectively. Experiments are conducted under a 5-shot setting, and the details of our results will be elaborated on in the following.

GTA5 to cityscapes Tab.1 shows the results on task GTA5

\to

Cityscapes with comparisons to state-of-the-art methods. From the results, we can conclude that our approach significantly improves the models’ adaptation ability with both the VGG16 and the ResNet-101 networks. Specifically, our method outperforms the source-only method by

24.2 %

and

16.6 %

with the VGG16 and the ResNet-101, respectively. Meanwhile, compared to state-of-the-art unsupervised domain adaptation methods, our method also brings a large margin improvement with

5.5 %

and

2.3 %

points under the architectures VGG16 and ResNet-101, respectively. In addition, we ablatively compare the proposed approaches to the baseline model. We observe that both the semantic consistency module and the transferable prototype module are beneficial to improve domain adaptation ability. They bring over

6.8 %

and

6.9 %

improvement based on the ResNet-101 while

4.4 %

and

3.9 %

points on the VGG16, respectively. It should be noticed that these two factors are complementary to each other. Simultaneously applying both of them brings

9.2 %

and

5.7 %

improvements on ResNet-101 and VGG16, respectively.

SYNTHIA to cityscapes In the task of SYNTHIA

\to

Cityscapes, we consider the 13 categories following [7] for a fair comparison. The results are shown in Tab.2. We can observe similar trends as the task GTA5

\to

Cityscapes in experiments. Specifically, our method outperforms the baseline method by

7.7 %

and

6.9 %

points on VGG16 and ResNet-101, respectively. Note that our method attains

9.4 %

and

8.0 %

gains with only 5-shot images labeled. It further demonstrates the generalization and effectiveness of our approach.

Tab.1 Results of adaptation from GTA5 to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then, we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold

GTA5→Cityscapes
Method	Arch.	Road	Side	Build	Wall	Fence	Pole	Light	Sign	Vege	Terr	Sky	Pers	Rider	Car	Truck	Bus	Train	Motor	Bike	mIoU
Source-only	V	26.0	14.9	65.1	5.5	12.9	8.9	6.0	2.5	70.0	2.9	47.0	24.5	0.0	40.0	12.1	1.5	0.0	0.0	0.0	17.9
FCNs [2]	V	0.4	32.4	62.1	14.9	5.4	10.9	14.2	2.7	79.2	21.3	64.6	44.1	4.2	70.4	8.0	7.3	0.0	3.5	0.0	27.1
CyCADA [24]	V	85.6	30.7	74.7	14.4	13.0	17.6	13.7	5.8	74.6	15.8	69.9	38.2	3.5	72.3	16.0	5.0	0.1	3.6	0.0	29.2
MCD [3]	V	86.4	8.5	76.1	18.6	9.7	14.9	7.8	0.6	82.8	32.7	71.4	25.2	1.1	76.3	16.1	17.1	1.4	0.2	0.0	28.8
AdaptSeg [7]	V	87.3	29.8	78.6	21.1	18.2	22.5	21.5	11.0	79.7	29.6	71.3	46.8	6.5	80.1	23.0	26.9	0.0	10.6	0.3	35.0
CLAN [5]	V	88.0	30.6	79.2	23.4	20.5	26.1	23.0	14.8	81.6	34.5	72.0	45.8	7.9	80.5	26.6	29.9	0.0	10.7	0.0	36.6
Baseline	V	93.4	57.6	79.9	23.0	21.3	23.7	15.1	11.7	80.9	37.8	83.5	42.2	9.2	78.4	9.5	0.9	15.4	4.8	3.7	36.4
$M_{s_c y c}$	V	94.2	62.4	82.5	20.8	30.6	26.9	23.6	22.9	82.3	39.0	87.3	50.5	16.2	79.9	17.7	4.9	11.9	6.6	15.9	40.8
$M_{p r o t o}$	V	94.4	62.9	82.2	21.4	26.3	27.9	23.8	21.5	84.7	38.5	85.3	51.4	13.9	80.6	14.2	4.1	3.8	3.8	24.0	40.3
Ours (all)	V	93.7	58.9	82.7	31.4	28.1	26.8	22.2	22.8	83.5	40.2	86.1	49.0	17.1	78.9	25.4	3.9	20.6	5.8	21.0	42.1

Source-only	R	75.8	16.8	77.2	12.5	21.0	25.5	30.1	20.1	81.3	24.6	70.3	53.8	26.4	49.9	17.2	25.9	6.5	25.3	36.0	36.0
AdaptSeg [7]	R	86.5	25.9	79.8	22.1	20.0	23.6	33.1	21.8	81.8	25.9	75.9	57.3	26.2	76.3	29.8	32.1	7.2	29.5	32.5	41.1
CLAN [5]	R	87.0	27.1	79.6	27.3	23.3	28.3	35.5	24.2	83.6	27.4	74.2	58.6	28.0	76.2	33.1	36.7	6.7	31.9	31.4	43.2
MRNet [42]	R	89.1	23.9	82.2	19.5	20.1	33.5	42.2	39.1	85.3	33.7	76.4	60.2	33.7	86.0	36.1	43.3	5.9	22.8	30.8	45.5
R-MRNet [43]	R	90.4	31.2	85.1	36.9	25.6	37.5	48.8	48.5	85.3	34.8	81.1	64.4	36.8	86.3	34.9	52.2	1.7	29.0	44.6	50.3
Baseline	R	93.8	59.4	79.9	21.5	19.9	26.2	22.9	18.9	83.5	40.7	84.7	58.3	25.6	86.1	37.6	39.8	3.7	11.3	10.2	43.4
$M_{s_c y c}$	R	95.2	67.6	85.0	27.0	30.5	33.0	38.2	47.8	86.6	44.3	85.9	60.3	33.8	86.7	20.6	14.9	24.2	15.7	56.4	50.2
$M_{p r o t o}$	R	95.2	65.2	85.1	26.4	30.5	34.1	39.1	48.7	86.5	46.4	86.0	62.2	35.2	85.4	8.75	10.4	25.5	24.0	58.4	50.3
Ours (all)	R	95.6	68.8	85.6	27.6	35.6	35.4	40.2	45.2	88.3	46.5	87.6	61.3	36.5	86.3	30.8	10.2	32.7	22.4	57.2	52.6

Tab.2 Results of adaptation from SYNTHIA to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold

SYNTHIA → Cityscapes
Method	Arch.	Road	Side.	Build.	Light	Sign	Vege.	Sky	Pers.	Rider	Car	Bus	Motor	Bike	mIoU
Source-only	V	6.4	17.7	29.7	0.0	7.2	30.3	66.8	51.5	1.5	47.3	3.9	0.1	0.0	20.2
FCNs [2]	V	11.5	19.6	30.8	0.1	11.7	42.3	68.7	51.2	3.8	54.0	3.2	0.2	0.6	22.9
CDA [8]	V	65.2	26.1	74.9	3.7	3.0	76.1	70.6	47.1	8.2	43.2	20.7	0.7	13.1	34.8
Cross-city [41]	V	62.7	25.6	78.3	1.2	5.4	81.3	81.0	37.4	6.4	63.5	16.1	1.2	4.6	35.7
AdaptSeg [7]	V	78.9	29.2	75.5	0.1	4.8	72.6	76.7	43.4	8.8	71.1	16.0	3.6	8.4	37.6
CLAN [5]	V	80.4	30.7	74.7	1.4	8.0	77.1	79.0	46.5	8.9	73.8	18.2	2.2	9.9	39.3
Baseline	V	89.8	43.6	73.1	2.3	19.1	79.4	77.5	43.8	7.7	74.8	6.5	0.7	15.2	41.0
$M_{s_c y c}$	V	93.7	56.2	79.6	5.7	16.3	80.4	85.0	47.8	11.4	78.6	6.6	7.1	22.0	45.4
$M_{p r o t o}$	V	92.8	54.2	78.7	6.1	12.8	81.1	83.5	47.9	9.6	76.6	3.6	9.5	28.0	44.9
Ours	V	94.7	60.7	82.6	5.5	19.7	84.3	85.6	52.9	10.7	80.2	9.1	10.2	36.7	48.7

Source-only	R	55.6	23.8	74.6	6.1	12.1	74.8	79.0	55.3	19.1	39.6	23.3	13.7	25.0	38.6
AdaptSeg [7]	R	79.2	37.2	78.8	9.9	10.5	78.2	80.5	53.5	19.6	67.0	29.5	21.6	31.3	45.9
CLAN [5]	R	81.3	37.0	80.1	16.1	13.7	78.2	81.5	53.4	21.2	73.0	32.9	22.6	30.7	47.8
MRNet [42]	R	82.0	36.5	80.4	18.0	13.4	81.1	80.8	61.3	21.7	84.4	32.4	14.8	45.7	50.2
R-MRNet [43]	R	87.6	41.9	83.1	31.3	19.9	81.6	80.6	63.0	21.8	86.2	40.7	23.6	53.1	54.9
Baseline	R	88.1	42.4	79.9	16.4	21.8	80.0	77.1	57.6	24.6	75.5	20.0	11.2	40.5	48.9
$M_{s_c y c}$	R	94.5	61.3	83.4	16.9	24.0	84.8	88.2	61.6	21.9	84.1	27.8	7.1	49.4	54.2
$M_{p r o t o}$	R	93.4	56.9	82.7	7.2	27.6	83.5	86.8	60.6	24.0	82.0	22.0	11.5	46.7	52.7
Ours	R	93.4	57.5	83.2	18.3	29.0	83.9	87.3	60.1	30.2	83.6	38.3	11.3	49.3	55.8

Tab.3 Comparison of our approach with semi-supervised approaches on the Resnet-101 backbone and Cityscapes (CS) dataset

Method	N=375
Train on CS	55.1
Hung et al. [9]	57.1
Tarun et al. [10]	55.9
Mittal et al. [11]	59.3
Ours	59.8

Comparison to semi-supervised domain adaptation In addition to demonstrating the superiority of the proposed method over the unsupervised domain adaptation, we also compare it to the semi-supervised semantic segmentation works. For a fair comparison, we sample similar amounts of labeled target data in Tab.3. For a setting of N=375, which corresponds to using 375 labeled examples from the Cityscapes, our approach achieves the best mIoU value of

59.8 %

based on Resnet-101, demonstrating its effectiveness over those semi-supervised approaches.

4.4 Ablation studies

Effects of each component In this section, we analyze the effectiveness of the proposed modules, including the data perturbation-based semantic consistency constraint (

M_{s_c y c}

) and the category-level transferable prototype module (

M_{p r o t o}

). In addition, we also evaluate the effectiveness of the proposed category-level adaptation (

M_{c a t e}

) and task-level adaptation (

M_{t a s k}

) in the prototype module. Experiments are conducted on the GTA5

\to

Cityscapes task. We construct the aforementioned baseline model following the AdaptSeg [7] under the proposed few-shot domain adaptation setting. As shown in Tab.4, all proposed components bring observable improvements over the baseline. Specifically, the semantic consistency module benefits global adaptation by formalizing more robust representations with the data perturbation strategy. The prototypical module facilitates the adaptation of more tiny objects, e.g., traffic lights, traffic signs, and poles. This is because it helps to learn more discriminative category boundaries. What’s more, from Tab.4 we observe that the task-level adaptation is more powerful than the category-level adaptation, which outperforms the baseline model with

5.4 %

points on the ResNet-101 while the category-level adaptation is

2.0 %

. Nonetheless, simultaneously adopting both of them brings the best result with a

6.9 %

improvement.

Tab.4 Ablation studies of proposed modules

GTA5→Cityscapes
Method	VGG	ResNet
Baseline	36.4	43.4
$M_{s_c y c}$	40.8	50.2
$M_{c a t e}$	38.3	45.4
$M_{t a s k}$	39.2	48.8
$M_{p r o t o}$	40.3	50.3
Ours	42.1	52.6

Category-level transferable prototypical networks analysis The qualitative analysis is produced in this section to demonstrate the transferable prototype module can effectively align the category-level distributions between the source and the target domains. We first present the colored prediction results of randomly sampled images in the testing set as shown in Fig.2. We can observe that our approach yields better segmentation outputs with the correct prediction on tiny objects, e.g., pole and traffic light. We then use the t-SNE approach [46] to visualize the feature distributions obtained by different methods in Fig.3. For clarity, we only show four different categories, i.e., building in blue, pole in orange, traffic sign in red and traffic light in green. Compared to the baseline method, our approach aligns better in the category-level representations, and different categories are more separable with a larger metric distance.

Fig.2 Visualization of prediction results of different methods

Full size|PPT slide

Fig.3 Visualization of the t-SNE results of different methods

Full size|PPT slide

Data perturbation-based semantic consistency To further demonstrate the effectiveness of the proposed semantic consistency constraint (

M_{s_c y c}

), we provide the evaluation results on different training epochs in Fig.4. We compare the baseline method, the proposed consistency module, and the method with both the consistency constraint and the transferable prototype module. The performance of the baseline method without semantic consistency degrades when the epoch arrives at 40. When applying

M_{s_c y c}

, the accuracy of the proposed method continually increases with a significant improvement. Moreover, our final method has the same trends and achieves higher performance. We also observe that along the process of longer training epochs (until 150 eps), our approach only fluctuates near the best results in a relatively small range

\pm 0.9 %

, while the baseline approach fluctuates much severely in the between

21 %

and

30 %

. These results indicate that the proposed data perturbation-based semantic consistency constraint can effectively enhance the generalization and robustness of models. At the same time, it overcomes the overfitting problem due to few labeled samples.

Fig.4 Evaluation of the performance along different training epochs. The data perturbation-based semantic consistent effectively alleviates the overfitting problem

Full size|PPT slide

N-shot analysis We also evaluate the

N

-shot scenarios to provide insights of the performance changing with different numbers of the annotated target images. Specifically, we report the

1

-shot,

2

-shot,

3

-shot,

5

-shot, and

10

-shot results in Tab.5. It is noteworthy that even with the extreme

1

-shot case, our approach achieves the performance of 46.9% mIoU with the ResNet, outperforming many UDA methods like the AdaptSeg [7] and CLAN [5] with a large margin. What’s more, our

10

-shot achieves comparable performance to the full supervision setting in some categories, such as road, building, and vegetation with accuracy

95.6 %

96.4 %

83.8 %

86.6 %

, and

85.4 %

87.1 %

based on VGG16 networks. These results demonstrate that our approach is prominent in balancing the annotation burden and the model performance and is beneficial to practical domain transformation applications.

Tab.5 Results of N-shot target samples

GTA5→Cityscapes
Method	VGG	ResNet
$1$ -shot	36.8	46.9
$2$ -shot	37.6	48.8
$3$ -shot	38.6	49.7
$5$ -shot	42.1	52.6
$10$ -shot	48.6	56.5
Full	58.5	65.1

Data perturbation results The semantic consistency constraint relies on the data perturbation strategy to learn. The data perturbation function is helpful to learn more robust representations by providing more different samples in expanded style space. We check the data perturbation results on the Cityscapes dataset in Fig.5, and confirm that it can provide more style information while maintaining the content structure of the original images.

Fig.5 Visualization of data perturbation for Cityscapes. The first row indicates the original images, and the second row indicates the augmented images

Full size|PPT slide

5 Conclusions

In this paper, we propose a novel few-shot domain adaptation problem for the semantic segmentation task, which leverages few-shot target domain labeled images for training. To effectively employ the few-shot labels, we propose the data perturbation-based strategies and transferable prototype-based modules, which help to learn more robust representations and align the cross-domain features. We conducted detailed experiments to analyze the proposed approach and demonstrated the effectiveness of our method in utilizing the few-shot labels for domain adaptation. Specifically, we demonstrate that the proposed approach is prominent in balancing the annotation burden and model performance, showing its value in practical domain adaptation applications.

Acknowledgements

This work was supported in part by the National Key R&D Program of China (2019QY1604), the Major Project for New Generation of AI (2018AAA0100400), the National Youth Talent Support Program, and the National Natural Science Foundation of China (Grant Nos. U21B2042, 62006231, and 62072457).

References

Publishing order | Descend order by publishing year | Descend order by cited within

1	Chen L C , Papandreou G , Kokkinos I , Murphy K , Yuille A L . Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 4): 834– 848

2	Hoffman J Wang D Yu F Darrell T. FCNs in the wild: pixel-level adversarial and constraint-based adaptation. 2016, arXiv preprint arXiv: 1612.02649

3	Saito K Watanabe K Ushiku Y Harada T. Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3723– 3732

4	Wang Y Peng J Zhang Z. Uncertainty-aware pseudo label refinery for domain adaptive semantic segmentation. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9072– 9081

5	Luo Y Zheng L Guan T Yu J Yang Y. Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2502– 2511

6	Zhang Y Qiu Z Yao T Liu D Mei T. Fully convolutional adaptation networks for semantic segmentation. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6810– 6818

7	Tsai Y H Hung W C Schulter S Sohn K Yang M H Chandraker M. Learning to adapt structured output space for semantic segmentation. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7472– 7481

8	Zhang Y David P Gong B. Curriculum domain adaptation for semantic segmentation of urban scenes. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 2039– 2049

9	Hung W C Tsai Y H Liou Y T Lin Y Y Yang M H. Adversarial learning for semi-supervised semantic segmentation. In: Proceedings of British Machine Vision Conference 2018. 2018, 65

10	Kalluri T Varma G Chandraker M Jawahar C V. Universal semi-supervised semantic segmentation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5258– 5269

11	Mittal S Tatarchenko M Brox T. Semi-supervised semantic segmentation with high- and low-level consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(4): 1369– 1379

12	Goodfellow I J Pouget-Abadie J Mirza M Xu B Warde-Farley D Ozair S Courville A Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014, 2672– 2680

13	Lim S Kim I Kim T Kim C Kim S. Fast autoaugment. In: Proceedings of Neural Information Processing Systems 32. 2019, 6662– 6672

14	Liu T Yang Q Tao D. Understanding how feature structure transfers in transfer learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 2365– 2371

15	Ge P Ren C X Dai D Q Yan H. Domain adaptation and image classification via deep conditional adaptation network. 2020, arXiv preprint arXiv: 2006.07776

16	Wittich D , Rottensteiner F . Appearance based deep domain adaptation for the classification of aerial images. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 180: 82– 102

17	He Z Zhang L. Multi-adversarial faster-RCNN for unrestricted object detection. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6667– 6676

18	He Z Zhang L. Domain adaptive object detection via asymmetric tri-way faster-RCNN. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 309– 324

19	Song L , Xu Y , Zhang L , Du B , Zhang Q , Wang X . Learning from synthetic images via active pseudo-labeling. IEEE Transactions on Image Processing, 2020, 29: 6452– 6465

20	Gao L Zhang J Zhang L Tao D. DSP: dual soft-paste for unsupervised domain adaptive semantic segmentation. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 2825– 2833

21	Pape C , Matskevych A , Wolny A , Hennies J , Mizzon G , Louveaux M , Musser J , Maizel A , Arendt D , Kreshuk A . Leveraging domain knowledge to improve microscopy image segmentation with lifted multicuts. Frontiers in Computer Science, 2019, 1: 6

22	Quan T M , Hildebrand D G C , Jeong W K . Fusionnet: a deep fully residual convolutional neural network for image segmentation in connectomics. Frontiers in Computer Science, 2021, 3: 613981

23	Baniukiewicz P , Lutton E J , Collier S , Bretschneider T . Generative adversarial networks for augmenting training data of microscopic cell images. Frontiers in Computer Science, 2019, 1: 10

24	Hoffman J Tzeng E Park T Zhu J Y Isola P Saenko K Efros A A Darrell T. CyCADA: cycle-consistent adversarial domain adaptation. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1989– 1998

25	Yao T Pan Y Ngo C W Li H Mei T. Semi-supervised domain adaptation with subspace learning for visual recognition. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2142– 2150

26	Saito K Kim D Sclaroff S Darrell T Saenko K. Semi-supervised domain adaptation via minimax entropy. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 8049– 8057

27	Zhang H Cisse M Dauphin Y N Lopez-Paz D. Mixup: beyond empirical risk minimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018

28	DeVries T Taylor G W. Improved regularization of convolutional neural networks with cutout. 2017, arXiv preprint arXiv: 1708.04552

29	Yun S Han D Chun S Oh S J Yoo Y Choe J. CutMix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6022– 6031

30	Cubuk E D Zoph B Mane D Vasudevan V Le Q V. Autoaugment: learning augmentation policies from data. 2018, arXiv preprint arXiv: 1805.09501

31	Zoph B Cubuk E D Ghiasi G Lin T Y Shlens J Le Q V. Learning data augmentation strategies for object detection. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 566– 583

32	Zhang L Zhou Y Zhang L. On the robustness of domain adaption to adversarial attacks. 2021, arXiv preprint arXiv: 2108.01807

33	Koch G Zemel R Salakhutdinov R. Siamese neural networks for one-shot image recognition. In: Proceedings of the 32nd International Conference on Machine Learning. 2015

34	Vinyals O Blundell C Lillicrap T Wierstra D. Matching networks for one shot learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 3637– 3645

35	Snell J Swersky K Zemel R. Prototypical networks for few-shot learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 4077– 4087

36	Bergstra J S Bardenet R Bengio Y Kégl B. Algorithms for hyper-parameter optimization. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 2546– 2554

37	Miyato T , Maeda S I , Koyama M , Ishii S . Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41( 8): 1979– 1993

38	Richter S R Vineet V Roth S Koltun V. Playing for data: Ground truth from computer games. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 102– 118

39	Cordts M Omran M Ramos S Rehfeld T Enzweiler M Benenson R Franke U Roth S Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3213– 3223

40	Ros G Sellart L Materzynska J Vazquez D Lopez A M. The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3234– 3243

41	Chen Y H Chen W Y Chen Y T Tsai B C Wang Y C F Sun M. No more discrimination: cross city adaptation of road scene segmenters. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 2011– 2020

42	Zheng Z Yang Y. Unsupervised scene adaptation with memory regularization in vivo. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2021, 150

43	Zheng Z , Yang Y . Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision, 2021, 129( 4): 1106– 1120

44	Loshchilov I Hutter F. SGDR: stochastic gradient descent with warm restarts. In: Proceedings of the 5th International Conference on Learning Representations. 2016

45	Paszke A Gross S Chintala S Chanan G Yang E DeVito Z Lin Z Desmaison A Antiga L Lerer A. Automatic differentiation in PyTorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017

46	van der Maaten L , Hinton G . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9( 86): 2579– 2605

Options

Outlines

Abstract
Cite this article
1 Introduction
2 Related work
3 Approach
3.1 Overview of the proposed model
3.2 Adversarial semantic segmentation
3.3 Adaptive data perturbation
3.4 Transferable prototype module
Fig.1 Source images Is, labeled target images It, unlabeled target images Iu and perturbed images Ia are forwarded into the segmentation networks G. The corresponding latent features are represented by cuboid with different colors. The features obtained from Is and It are trained for segmentation, and Is and Iu are used to train discriminator D. Furthermore, Ia and Iu construct the semantic consistency constraint and all the features are used to train the transferable prototypical networks
3.5 Training objective
4 Experiments
4.1 Datasets
4.2 Implementation details
4.3 Experimental results
Tab.1 Results of adaptation from GTA5 to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then, we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold
Tab.2 Results of adaptation from SYNTHIA to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold
Tab.3 Comparison of our approach with semi-supervised approaches on the Resnet-101 backbone and Cityscapes (CS) dataset
4.4 Ablation studies
Tab.4 Ablation studies of proposed modules
Fig.2 Visualization of prediction results of different methods
Fig.3 Visualization of the t-SNE results of different methods
Fig.4 Evaluation of the performance along different training epochs. The data perturbation-based semantic consistent effectively alleviates the overfitting problem
Tab.5 Results of N-shot target samples
Fig.5 Visualization of data perturbation for Cityscapes. The first row indicates the original images, and the second row indicates the augmented images
5 Conclusions
Acknowledgements
References

About the journal

Browse

Authors & reviewers

Abstract

Cite this article

1 Introduction

2 Related work

3 Approach

3.1 Overview of the proposed model

3.2 Adversarial semantic segmentation

3.3 Adaptive data perturbation

3.4 Transferable prototype module

3.5 Training objective

4 Experiments

4.1 Datasets

4.2 Implementation details

4.3 Experimental results

Tab.1 Results of adaptation from GTA5 to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then, we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold

Tab.2 Results of adaptation from SYNTHIA to Cityscapes. We first compare with the state-of-the-art UDA algorithms adopting the VGG16 (V) and ResNet-101 (R) networks. Then we report our results with (s_cyc)/(proto) modules respectively. We highlight the best result in each column in bold

Tab.3 Comparison of our approach with semi-supervised approaches on the Resnet-101 backbone and Cityscapes (CS) dataset

4.4 Ablation studies

Tab.4 Ablation studies of proposed modules

Fig.2 Visualization of prediction results of different methods

Fig.3 Visualization of the t-SNE results of different methods

Fig.4 Evaluation of the performance along different training epochs. The data perturbation-based semantic consistent effectively alleviates the overfitting problem

Tab.5 Results of N-shot target samples

Fig.5 Visualization of data perturbation for Cityscapes. The first row indicates the original images, and the second row indicates the augmented images

5 Conclusions

Acknowledgements

References