KD-Crowd: a knowledge distillation framework for learning from crowds

Shaoyuan LI; Yuxiang ZHENG; Ye SHI; Shengjun HUANG; Songcan CHEN

doi:10.1007/s11704-023-3578-7

PDF(5087 KB)

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (1) : 191302. DOI: 10.1007/s11704-023-3578-7

Artificial Intelligence

RESEARCH ARTICLE

KD-Crowd: a knowledge distillation framework for learning from crowds

Author information +

History +

Abstract

Recently, crowdsourcing has established itself as an efficient labeling solution by distributing tasks to crowd workers. As the workers can make mistakes with diverse expertise, one core learning task is to estimate each worker’s expertise, and aggregate over them to infer the latent true labels. In this paper, we show that as one of the major research directions, the noise transition matrix based worker expertise modeling methods commonly overfit the annotation noise, either due to the oversimplified noise assumption or inaccurate estimation. To solve this problem, we propose a knowledge distillation framework (KD-Crowd) by combining the complementary strength of noise-model-free robust learning techniques and transition matrix based worker expertise modeling. The framework consists of two stages: in Stage 1, a noise-model-free robust student model is trained by treating the prediction of a transition matrix based crowdsourcing teacher model as noisy labels, aiming at correcting the teacher’s mistakes and obtaining better true label predictions; in Stage 2, we switch their roles, retraining a better crowdsourcing model using the crowds’ annotations supervised by the refined true label predictions given by Stage 1. Additionally, we propose one f-mutual information gain ( $M I G^{f}$ ) based knowledge distillation loss, which finds the maximum information intersection between the student’s and teacher’s prediction. We show in experiments that $M I G^{f}$ achieves obvious improvements compared to the regular KL divergence knowledge distillation loss, which tends to force the student to memorize all information of the teacher’s prediction, including its errors. We conduct extensive experiments showing that, as a universal framework, KD-Crowd substantially improves previous crowdsourcing methods on true label prediction and worker expertise estimation.

Graphical abstract

Keywords

crowdsourcing / label noise / worker expertise / knowledge distillation / robust learning

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Shaoyuan LI, Yuxiang ZHENG, Ye SHI, Shengjun HUANG, Songcan CHEN. KD-Crowd: a knowledge distillation framework for learning from crowds. Front. Comput. Sci., 2025, 19(1): 191302 https://doi.org/10.1007/s11704-023-3578-7

1 Introduction

Deep neural networks (DNNs) have become essential for a variety of tasks, but their successes rely heavily on large amounts of high-quality labeled data, which are not available in many applications due to the labeling cost or difficulty. In recent years, with the advent of crowdsourcing services such as Amazon Mechanical Turk, crowdsourcing has been used for collecting data annotations in numerous fields including sentiment classification [1], medical diagnosis [2, 3], vision tagging [4], and image captioning [5].

To alleviate individual labeling errors, crowdsourcing commonly distributes the tasks to multiple workers and aggregates them over their annotations. As the workers can be inherently subjective with various skill levels, modeling each worker’s expertise is intrinsically needed in many scenarios. A major part of crowdsourcing studies are devoted to such aspects. For instance, the seminal Dawid-Skene (DS) model [6] leveraged labeling error rates to quantify workers’ expertise and utilized the expectation maximization (EM) algorithm to optimize annotation likelihood. Following DS, extensive variants have been proposed by jointly modeling the skills of the workers and inferring the latent true labels [2, 4, 7].

Generally, the worker expertise model used by existing methods can be formulated as a transition matrix, i.e.,

T^{m} \in [0, 1]^{C \times C}

, with

T_{i j}^{m} = P ({\bar{Y}}_{m} = j | Y = i)

representing the probability that some worker

m

annotates class label

{\bar{Y}}_{m} = j

to some instance

x

with true class label

Y = i

. Existing robust learning theory [8] has shown that with unbiased transition matrix estimation, the classifier trained on noisy labels guarantees to converge to the optimal classifier defined on the clean labels. Although theoretically grounded, in practice the performance of existing crowdsourcing methods heavily relies on the success of estimating the workers’ expertise. On the one hand, for identifiability,

T^{m}

is often assumed to be class-dependent, which is oversimplified for complex noise patterns; on the other hand, estimating the unknown worker expertise and latent true labels interact with each other, for which estimation errors occur and propagate easily. We will show that, for the part of annotation noise that can’t be captured by the learned transition matrix, existing methods easily overfit them and result in degenerated performance.

Fig.1 shows an illustrative example. We record the dynamic learning process of one representative crowdsourcing approach Max-MIG [7], following its original experiment setting on CIFAR-10 with

60 %

symmetric noise and independent mistakes^†. For each worker, we compare its prediction for the true labels of instances and its annotations. We use

p

to denote the fraction of wrong annotations that are memorized for each worker, i.e., the true label predictions that equal to the wrong annotations. In Fig.1, we signify the mean

p

values of all workers, while the shaded regions encapsulate the range within one standard deviation of the respective means. From Fig.1, it can be seen that the worker models’ true label predictions gradually memorize the wrong annotations. Actually, such an overfitting phenomenon is common for existing crowdsourcing methods, see experimental results in Section 4.2. However, this issue is rarely noticed and studied. In this paper, we propose to leverage the recent advances in noise-model-free robust learning techniques and provide a general framework to effectively boost current approaches.

Fig.1 Illustrative example for dynamic memorization of annotation noise. The experiment is conducted on CIFAR-10 with $60 %$ symmetric noise and independent mistakes. The vertical axis means the fraction of wrong annotations that are memorized by each worker’s true label predictions. The shaded regions for the curves encapsulate the range within one standard deviation of the respective means

Full size|PPT slide

Recently, noise-model-free robust learning techniques have achieved excellent success in label noise learning. Among them, one major type is built on the memorization effect of DNNs [9], i.e., during the learning process, the DNNs first fit the clean labels, then gradually overfit the wrong labels. Inspired by this phenomenon, tricks including early stopping and small-loss are exploited to combat with label noise, i.e., they identify the noisy labels through ending the training early, then the small-loss and large-loss instances can be respectively regarded as clean and noisy instances [10]. In combination with other techniques such as pseudo-labeling, multi-networks, data augmentation and self-supervised learning, numerous approaches have been proposed with promising results [11–17].

Despite the advances of current noise-model-free robust learning techniques, they are rarely leveraged in the crowdsourcing field to help worker expertise and true label estimation. In this paper, without a sophisticated redesign of models, we propose a knowledge distillation framework named KD-Crowd, which incorporates the noise-model-free robust techniques to refine crowdsourcing learning. The key idea of KD-Crowd is to combine the complementary strengths of two types of noisy label learning methods. The crowdsourcing methods, e.g., Max-MIG [7], deal with the annotation noise mainly based on the transition matrix noise model to formulate workers’ expertise. They excel at capturing patterned noise but leave out relatively random noise and overfit them. Mainly based on the memorization effect of DNNs, the noisy-model-free methods are superior to random noise filtering. Thus they complement the crowdsourcing models and are expected to correct their mistakes.

Specifically, the framework is implemented in two stages. In Stage 1, we first train the transition matrix based crowdsourcing model with early stop before overfitting as a teacher, then train a noise-model-free robust student model mimicking the teacher’s true label predictions, aiming at refining the mistakes that are not captured by the transition matrix noise model. In Stage 2, we switch their roles, retraining a better crowdsourcing model using the crowds’ annotations supervised by the refined true label predictions. As the regular KL divergence distillation loss encourages the student to memorize all information of the teacher’s prediction, it’s too strong for our scenario where the teacher can make serious mistakes. To alleviate this, we propose one f-mutual information gain (

M I G^{f}

) distillation loss, which finds the maximum information intersection between the student’s and teacher’s prediction.

Exploiting the recent ELR+ [16] algorithm as the noise-model-free method, we apply KD-Crowd to Max-MIG [7], showing in Fig.1 that it greatly reduces the memorization of wrong annotations for the workers. Extensive experiments on synthetic and real-world crowdsourcing data show that KD-Crowd improves Max-MIG with

0.8 %

18.15 %

absolute test accuracy, and significantly outperforms the other methods. Actually, as a general framework, KD-Crowd is applicable to various crowdsourcing methods for consistent performance gains. At the end of the experiment, we give results for implementing KD-Crowd by using another noisy-model-free method DivideMix [15] over crowdsourcing models Max-MIG [7] and CrowdLayer [4].

In summary, the contribution of this paper is as follows:

● We are possibly the first to show that existing crowdsourcing methods easily overfit the annotation noise.

● Without sophisticated redesign, we propose a knowledge distillation based framework KD-Crowd, which leverages noise-model-free learning techniques to refine crowdsourcing learning. Besides, we propose one f-mutual information gain (

M I G^{f}

) based knowledge distillation loss to prevent the student from memorizing serious mistakes of the teacher model.

● We validate the effectiveness and generality of KD-Crowd over two different crowdsourcing models on synthetic and real-world data, showing that KD-Crowd achieves significant performance gains.

In the following we review related work in Section 2, then propose our approach in Section 3, and report the experimental results in Section 4. Finally, we conclude the paper.

2 Related work

To infer the latent true labels from crowds’ annotations, one naive approach is Majority Voting (MV), which however ignores workers’ expertise differences. To counter this issue, the DS Model [6] used the labeling error rate of each worker’s annotation to parameterize their expertise and proposed an EM algorithm to iteratively estimate the error rates and latent true labels. Served as the basis for large amounts of models, DS was extended from different perspectives including efficient inference [18], flexible worker expertise modeling [19], and classifier learning [2, 20].

Recently, with the success of DNNs that allow for representation learning of data, deep crowdsourcing learning has been studied to build a DNN classifier from the noisy annotations and set up the new sota [3, 4, 7, 21–26]. AggNet [3] extended DS [6] by using a convolutional neural network (CNN) classifier as the true label prior, a confusion matrix for each worker’s expertise modeling, and followed the EM optimization procedure. To avoid the computational overhead of the iterative EM procedure works on novel network architectures have been proposed with efficient end-to-end optimization. Weighted Doctor Net (WDN) [21] constructed one network classifier for each worker with shared feature extraction sub-network, and learned all network parameters by minimizing the annotation prediction errors in an end-to-end manner. Then the worker’s expertise is learned by minimizing the loss between the weighted average of the classifiers’ predictions and the mean average of the observed annotations. CrowdLayer [4], Anno-Reg [22], and CoNAL [23] appended a layer of coefficients for each worker after the output layer of the DNN classifier to characterize the annotator expertise, with the final output layer corresponding to the crowdsourcing annotation prediction. All parameters in the network are optimized end-to-end by minimizing the annotation prediction loss. Max-MIG [7] proposed one information-theoretic loss to maximize the information intersection between the DNN classifier’s prediction and the weighted combination of the annotations.

The worker expertise modeling used by existing methods can be generally formulated as a noise transition matrix, denoting the transition relationship from clean labels to noisy labels. While the models have been shown to be rather effective, we found in this paper that, they easily overfit the annotation noise. The reasons include but are not limited to 1) for more complex noise patterns, e.g., instance-dependent noise, the transition matrix assumption is oversimplified; 2) estimating the transition matrix relies on the noisy class and latent true label posteriors, whose errors interact with each and happen commonly, leading to degenerated learning performance. There have been works exploring this issue through new noise structure design [27] and novel transition matrix estimation approach [28]. In this paper, without a sophisticated redesign of the noise model, we propose a general framework for improving existing methods. We incorporate the noise-model-free robust techniques through a knowledge distillation framework to refine the crowdsourcing models.

Without modeling the underlying noise transition pattern, noise-model-free robust methods can be divided into two categories: by proposing noise-tolerant losses for statistical consistency guarantees [29–31], and designing heuristics to identify noisy data based on the dynamic process of optimization policies. The latter type mainly relates to the memorization effect of learning models [9, 16], i.e., they tend to memorize and fit easy (clean) patterns first, and then gradually overfit hard (noisy) patterns. Inspired by it, works are studied to trade-off the overfitting/underfitting of noise using early stopping [32] and small-loss tricks [12], which respectively avoid overfitting noisy labels to some degree by ending training early, and treat the small-loss instances as clean instances and only back propagates them to update the model parameters. However, such memorization effect and techniques are rarely leveraged in the crowdsourcing field. In this paper, we exploit one sota noise-model-free robust method ELR+ [16], which improves the classifier network robustness by keeping shadow true label predictions and designing a regularization term to encourage the network to fit them.

Knowledge distillation (KD) is a general framework to transfer knowledge across different models [33]. In knowledge distillation from ensembles, REFNE, proposed in [34], augments the comprehensibility of trained neural network ensembles by generating instances from these ensembles and extracting symbolic rules; [35] introduces a strategy that uses a network ensemble to form a novel training set by replacing initial class labels with those produced by the ensemble and including additional examples generated from the ensemble. In the pursuit of understanding the soft-target regularizer, [36] advocates for the promotion of diversity as a regularization technique in ensemble methods, culminating in the application of explicit diversity regularization to ensemble pruning. In noisy label learning, [37] proposed to train a teacher network with a small amount of trusted data, and distill a more robust student model over the noisy labels; [38] extended it using more tricks including instance reweighting and filtering, data augmentation and label refinement. These methods rely on clean trusted labels to train the teacher model, which is different from our approach.

Model reuse, an approach that leverages existing models—typically trained for different tasks—aims to construct a new model without starting from scratch. A more sophisticated scheme, Fixed Model Reuse (FMR), is proposed in [39]. FMR exploits the learning capabilities of deep neural models to implicitly glean useful discriminative information from fixed models or features extensively employed across various tasks.

3 The proposed approach

3.1 Preliminaries

We use

X = {x_{1}, x_{2}, \dots, x_{N}}

to denote the set of

N

training instances, where

x_{i}

means the

d

-dimensional feature values of the

i

th instance.

Y = {1, 2, \dots,

C}

denotes the label space. The annotations collected from

M

workers are defined as

\bar{Y} \in {Y, 0}^{N \times M}

, where

{\bar{y}}_{i, m} = c \in Y

means that the

i

th instance is categorized as

c

-th class by the

m

th worker, and

{\bar{y}}_{i, m} = 0

indicates that

m

th worker didn’t annotate the

i

th instance. The unobserved true labels are define as

Y = {y_{i}}_{i = 1}^{N}

. Our goal is to learn a classifier

h \overset{}{\to} Y

that generalizes well for the unseen test data given the noisy dataset

{X, \bar{Y}}

Next, we first briefly introduce one representative crowdsourcing model Max-MIG [7], then give the implementation details of KD-Crowd.

3.2 Background

Max-MIG [7] is a deep crowdsourcing method that seamlessly combines DNN representation learning with worker expertise modeling. Assuming that the annotations and instance features are conditionally independent with their information intersection indicating the true labels, Max-MIG jointly trains the DNN classifier and annotation aggregator by maximizing their

f

-mutual information gain. Specifically, it consists of two components: a DNN classifier

h (\cdot; θ)

with parameters

θ

and an annotation aggregator

g (\cdot; W^{[M]}; b)

parameterized by

W^{[M]} = {{W^{m} \in R}^{C \times C}}_{m = 1}^{M}

. For each instance

x

and its crowds’ annotations

{\bar{y}}^{[M]}

, Max-MIG inputs

x

into

h

and

{\bar{y}}^{[M]}

into

g

, and trains them in an end-to-end manner by maximizing the

f

-mutual information gain

M I G^{f}

of their outputs, i.e., the difference between the average agreements between

h

and

g

for the same tasks and their average disagreement for different tasks:

(1)

\begin{aligned} M I G^{f} (h (X), g (\bar{Y})) \\ = \frac{1}{N} \sum_{i} \partial f (\sum_{c \in C} \frac{h {(x_{i})}_{c} \cdot g {({\bar{y}}_{i}^{[M]})}_{c}}{p_{c}}) - \\ \frac{1}{N (N - 1)} \sum_{i \neq j} f^{⋆} (\partial f (\sum_{c \in C} \frac{h {(x_{i})}_{c} \cdot g {({\bar{y}}_{j}^{[M]})}_{c}}{p_{c}})) . \end{aligned}

Here

p

is a hyperparameter denoting the prior distribution of true labels, set as uniform distribution in Max-MIG.

f

is a convex divergence function satisfying

f (1) = 0

, with

f^{*}

its Fenchel duality. Max-MIG sets

f

as the KL divergence, which induces

f (t), \partial f (K), f^{*} (\partial (K))

as:

(2)

\begin{array}{l} f (t) = t \log t; \partial f (K) = 1 + \log K; f^{*} (\partial (K)) = K . \end{array}

With

S

denoting the softmax function, the annotation aggregator

g

is implemented as:

(3)

g ({\bar{y}}^{[M]}; {W}, b) = S (\sum_{m = 1}^{M} W^{m} \cdot e ({\bar{y}}^{m}) + b) .

Here

e ({\bar{y}}^{[m]})

denotes the one-hot representation of

{\bar{y}}^{m} \in {\bar{y}}^{[M]}

. When

{\bar{y}}^{m} = 0

meaning missing annotation,

e ({\bar{y}}^{m})

is a zero vector.

W^{m}

plays the role of an inverse noise transition matrix, i.e.,

W_{i j}^{m}

represents the probability of the true label being

j

for some instance when worker

m

gives annotation class label

i

. For optimization easiness,

W^{m}

is relaxed to be a real-valued matrix without structural constraints, and the true label probability is obtained by applying softmax operation over the average of all workers complemented by a bias vector

b

. Such relaxations are commonly used in deep crowdsourcing for computation efficiency, see [4, 23]. The bias term

b

is used to capture the potential shared additive labeling bias among workers.

As we have discussed in the introduction, the worker expertise modeling used by Max-MIG inevitably suffers from insufficiency and estimation error. Next, we propose the KD-Crowd framework as a general and effective improved solution.

3.3 KD-Crowd: a knowledge distillation framework for crowdsourcing

Fig.2 shows the overall framework of KD-Crowd. Specifically, KD-Crowd consists of two stages: (1) true label prediction refinement; (2) joint distillation and crowdsourcing retraining. In the following, we explain the details.

Fig.2 The KD-Crowd framework consists of two stages: (1) We train a noise-model-free robust student model to refine the true label predictions of the transition matrix based crowdsourcing teacher model; (2) We retrain the crowdsourcing model using crowds’ annotations and the refined true label predictions simultaneously

Full size|PPT slide

True label predictions refinement In this stage, we first pretrain the crowdsourcing model (e.g., Max-MIG in this paper) with early stop as the teacher model, before overfitting the annotation noise. Then we train a complementary robust student model over the one-hot label prediction of the teacher model as a refiner. Here we employ ELR+ (Early Learning Regularization) [16]. It improves the classifier robustness by keeping shallow true label predictions before overfitting.

Formally, using

D = {(x_{i}, {\tilde{y}}_{i})}_{i = 1}^{N}

to denote the noisy label dataset, the ELR+ loss function is defined as:

(4)

L_{E L R} := L_{C E} + \frac{λ_{1}}{N} \sum_{i = 1}^{N} l o g (1 - ⟨ h (x_{i}), t_{i} ⟩) .

Here

L_{C E}

represents the cross entropy loss over

D

λ_{1}

represents the coefficient of the regularization term,

⟨ h (x_{i}), t_{i} ⟩

denotes the inner product between the classifier’s softmax label prediction

h (x_{i})

and the target

t_{i}

. At iteration

k

of training, the target

t_{i} (k)

for instance

i

is defined as:

(5)

\begin{array}{l} t_{i} (k) := β t_{i}^{k - 1} + (1 - β) h^{k} (x), \end{array}

where

β \in [0, 1)

is the momentum of the moving average. ELR+ alternates between computing the targets and minimizing the cost function in Eq. (4). To improve the target estimation, ELR+ adopts a co-teaching paradigm by using two separate network classifiers

h_{1}, h_{2}

, each network mimics the output of the other one, such that the error of one network will not be returned to itself, and the confirmation bias easily introduced by one single network will be alleviated. Besides, it also incorporates data augmentation and semi-supervised techniques to achieve better performance. By exploiting the regularization term, ELR+ boosts the gradient of instances with clean labels and neutralizes the gradient of the instances with false labels.

Joint distillation and crowdsourcing retraining In this stage, we switch the roles of teacher and student. We freeze the ELR+ classifiers

h_{1}, h_{2}

trained in the first stage as the teacher model, and calculate the refined true label prediction

{\hat{Y}}^{'}

as the softmax of weighted average over

h_{1}, h_{2}

(6)

\begin{array}{l} {\hat{Y}}^{'} = {{\hat{y}}_{i}^{'}}_{i = 1}^{N}, {\hat{y}}_{i}^{'} = S (h_{1} (x_{i}) / τ + h_{2} (x_{i}) / τ) . \end{array}

Here

S

is the softmax function,

τ

is the temperature parameter. Then we retrain the crowdsourcing model from annotations

\bar{Y}

aided with the distilled knowledge

{\hat{Y}}^{'}

. We retrain the crowd model expecting to further filter out the new mistakes introduced by the student in Stage 1. Formally, the training loss is defined as:

(7)

L = - M I G^{f} (h (X), g (\bar{Y})) - λ M I G^{f} (\frac{h (X) + h (X^{'})}{2}, {\hat{Y}}^{'}) .

Here

h

and

g

are the DNN classifier and annotation aggregator of the Max-MIG model. The first term is the regular Max-MIG loss over the crowds’ annotations. The second term is the knowledge distillation loss defined for the classifier

h

over the distilled knowledge

{\hat{Y}}^{'}

. Note that instead of the regular KL divergence which enforces the classifier to memorize all information of

{\hat{Y}}^{'}

, including its errors, we propose to use the

f

-mutual information gain measure

M I G^{f}

of Max-MIG as the knowledge distillation loss in this paper. As elucidated in Theorem E.3 and Theorem 3.4 of Max-MIG [7], in situations wherein a high degree of correlation in labeling errors occurs, implementing KL divergence as the learning mechanism could inadvertently lead to an overemphasis on memorizing the correlated errors, subsequently resulting in erroneous predictions. Contrastingly,

M I G^{f}

, by identifying the maximum information intersection between the classifier’s prediction and

{\hat{Y}}^{'}

, exhibits a notable capacity to effectively manage these scenarios, manifesting its capability by assigning a null weight to all non-contributory labels, thus ensuring the preservation of the accuracy of predictions. Empirical evidence from our conducted ablation experiments also demonstrates that

M I G^{f}

attains considerable advancements compared to the KL divergence approach.

Inspired by [40], we additionally conduct RandAugment [41] over the input instances. RandAugment is an automated data augmentation method that significantly reduces the search space and allows training on target tasks with no need for a separate proxy task. We sequentially conduct random crop and horizontal flip on

X

and obtain augmented version

X^{'}

as additional input of

h

to improve the crowdsourcing model’s generalization performance. After the whole training is completed, we can either use each single student model of Stages 1 and 2 or their ensemble to predict labels for unseen test instances.

Algorithm 1 shows the whole pipeline. In Stage 1, ELR+ is the student model, refining the teacher model Max-MIG’s prediction. In Stage 2, their role changes, Max-MIG becomes the student model and distills knowledge from the learned ELR+ in Stage 1. Thus in Stage 1 (2), steps 2–4 (2–5) obtain predictions from the teacher Max-MIG (ELR+), and step 5 (6) trains the student model ELR+ (Max-MIG).

Full size|PPT slide

4 Experiments

4.1 Settings

Synthesized datasets We follow Max-MIG [7] for synthetic experiments on CIFAR-10 [42]. To simulate the workers’ mistakes, we corrupt the training labels according to three types of worker-specific transition matrices: instance dependent noise, symmetric noise and asymmetric noise. Moreover, as the workers commonly annotate only a subset of instances in real-world applications, we erase each worker’s

50 %

annotations equally, and obtain three data groups:

● Inst 20%: five workers provide annotations with instance-dependent noise with a ratio of

20 %

, following the method from [43], which are generated by breaking images down into parts, and the noise transition matrix of an instance is estimated by combining the transition matrices of its various parts.

● Sym 20% / 40% / 60% / 80%: five workers provide annotations with symmetric noise ratios, which are generated by flipping the class of true labels into the other classes uniformly.

● Asym: five workers provide annotations with asymmetric noise, which are generated by flipping the class of true labels into one single manually designated class. Specifically, we flip the classes between pairs including cat/dog, deer/horse, airplane/bird, automobile/trunk and frog/ship. All of them are difficult for human workers to distinguish, and the five workers give their different solutions. The workers’ accuracy are

50.4 %

49.8 %

69.9 %

79.9 %

, and

60.4 %

respectively.

To simulate the correlations among workers, each group includes three cases:

● Independent mistakes: all the workers are mutually independent.

● Effortless workers: two extra low-quality workers annotate instances with the first class label effortlessly. It will induce biased annotations when the other workers’ expertise is mutually different.

● Correlated mistakes: two extra workers copy two existing workers’ entire annotations. It breaks the classic worker-independent assumption and induces worker-related mistakes.

Real-world datasets We conduct experiments on two common real-world crowdsourced datasets: LabelMe [4] and CIFAR10-H [44]. LabelMe contains

1, 188

training,

500

validating, and

1, 000

testing

256 \times 256

color images ranged in

8

classes. It has

2, 547

crowd annotations from

59

workers whose mean accuracy is

69.2 %

. CIFAR10-H is a crowdsourced version of the CIFAR-10 testing set. We select the

60

worst workers whose mean accuracy is

70.4 %

and obtain

6, 955

images with

12, 000

annotations for training. Because it lacks a testing set, we randomly selected

10, 000

class-balanced images from the CIFAR-10 training set for testing.

Comparison methods Following the setting of [7] and considering recent developments of deep crowdsourcing learning, we take six methods for comparison including: DL-MV which learns a DNN with data labels aggregated by majority voting over the crowds’ annotations. WDN [21] which learns a DNN classifier for each worker and aggregates their output by weighted majority voting. CrowdLayer [4] which appends a layer of coefficients for each worker after the DNN classifier output layer to model the worker expertise. AggNet [3] which learns a CNN classifier as the latent true label prior and conducts EM optimization. CoNAL [23] which decomposes the annotation noise into individual and common parts to model the workers’ related mistakes. Max-MIG [7] which learns a DNN classifier and an annotation aggregator by maximizing their

f

-mutual information gain.

Implementations For fair comparisons, all methods on the same dataset use the same backbone network architecture and initial weights. For the synthesized datasets, we use ResNet-50 [45] as backbone trained for

50

epochs for baselines. The epochs of the pretraining and the Stage 1, 2 in KD-Crowd are

10, 30, 10

. For the real-world datasets, we use ShuffleNet V2 [46] on LabelMe as backbone trained for

75

epochs for baselines. The epochs of the pretraining and the Stage 1, 2 in KD-Crowd are 5, 60, 10. And we use ResNet-50 on CIFAR10-H as backbone trained for

100

epochs for baselines. The epochs of the pretraining and the Stage 1, 2 in KD-Crowd are 20, 60, 20. We claim that the models in KD-Crowd are trained with much fewer epochs than the baselines is reasonable and convincing. First, early stopping training is a common practice to avoid overfitting in label noise learning. Second, we didn’t carefully select epochs when the model performs the best, but compensated for the possible underfitting through our two-stage refinement procedure. Moreover, the best but still inferior performance of baselines shown in Tab.1 and Tab.2, the ablation studies shown in Tab.3, both help confirm that, the superiority of KD-Crowd comes from the knowledge distillation and refinement framework, but not over-tuning of each component.

Tab.1 Test accuracy (%) on CIFAR-10 under Inst 20%, Sym 20%, Sym 40%, Sym 60%, Sym 80% and Asym noise cases. For the baselines, we report both the best and last performance. For the proposed KD-Crowd, we report three results for better understanding. Bold values represent the best three methods, bold and underlined values represent the best methods

Noise scenarios		Inst 20%			Sym 20%			Sym 40%
Methods		Independent Mistakes	Effortless Workers	Correlated Mistakes	Independent Mistakes	Effortless Workers	Correlated Mistakes	Independent Mistakes	Effortless Workers	Correlated Mistakes
DL-MV	Best	72.45 ± 0.61	54.08 ± 5.83	64.79 ± 4.24	74.14 ± 0.12	10.00 ± 0.00	73.72 ± 0.44	69.65 ± 0.45	10.00 ± 0.00	69.46 ± 0.13
DL-MV	Last	71.65 ± 0.63	52.52 ± 6.13	63.69 ± 4.24	73.40 ± 0.47	10.00 ± 0.00	72.60 ± 1.28	69.19 ± 0.35	10.00 ± 0.00	67.69 ± 0.32
WDN	Best	68.73 ± 1.48	10.60 ± 0.68	68.47 ± 1.86	74.16 ± 0.52	10.30 ± 0.17	73.10 ± 0.58	69.39 ± 0.65	10.01 ± 0.07	67.39 ± 2.29
WDN	Last	68.25 ± 1.48	10.47 ± 0.58	68.15 ± 1.77	73.77 ± 0.62	10.24 ± 0.15	72.82 ± 0.65	69.03 ± 0.76	10.04 ± 0.04	66.79 ± 2.24
CrowdLayer	Best	68.31 ± 1.65	69.43 ± 1.30	63.79 ± 0.84	73.51 ± 0.15	73.81 ± 0.48	73.91 ± 0.21	70.86 ± 0.65	70.54 ± 0.26	69.76 ± 0.43
CrowdLayer	Last	61.75 ± 0.95	62.06 ± 0.58	62.95 ± 1.10	71.22 ± 0.66	71.02 ± 0.41	70.01 ± 0.21	61.54 ± 0.34	61.25 ± 0.51	60.59 ± 0.30
AggNet	Best	72.87 ± 0.47	51.98 ± 2.01	72.54 ± 0.30	74.27 ± 0.18	10.00 ± 0.00	73.91 ± 0.21	69.98 ± 0.11	10.00 ± 0.00	69.63 ± 0.23
AggNet	Last	72.12 ± 0.45	50.98 ± 2.06	71.82 ± 0.59	73.94 ± 0.38	10.00 ± 0.00	72.85 ± 0.34	69.05 ± 0.72	10.00 ± 0.00	68.80 ± 0.69
CoNAL	Best	72.13 ± 0.40	71.88 ± 0.64	70.47 ± 0.52	74.40 ± 0.29	71.99 ± 3.59	73.95 ± 0.25	70.42 ± 0.23	66.69 ± 1.74	70.15 ± 0.05
CoNAL	Last	65.84 ± 0.38	64.80 ± 0.19	63.76 ± 2.01	71.23 ± 0.32	67.47 ± 5.98	70.76 ± 0.84	60.79 ± 1.12	59.79 ± 3.59	59.98 ± 0.70
Max-MIG	Best	71.16 ± 0.55	70.61 ± 0.23	70.99 ± 0.52	73.91 ± 0.13	74.23 ± 0.35	72.72 ± 0.11	70.14 ± 0.45	70.36 ± 0.42	68.94 ± 0.44
Max-MIG	Last	63.65 ± 1.17	62.26 ± 0.28	63.47 ± 0.14	73.17 ± 0.74	73.64 ± 0.23	71.79 ± 0.45	68.37 ± 0.19	68.01 ± 0.54	64.17 ± 0.56
KD-Crowd	Stage 1	77.25 ± 0.42	77.03 ± 0.92	77.05 ± 0.78	85.12 ± 0.37	85.48 ± 0.32	84.77 ± 0.25	84.57 ± 0.29	84.62 ± 0.43	84.52 ± 0.50
	Stage 2	88.99 ± 0.86	88.23 ± 0.97	88.28 ± 0.58	89.10 ± 0.96	89.33 ± 0.84	89.28 ± 0.10	87.92 ± 0.52	87.91 ± 0.52	88.44 ± 0.30
	Ensemble	89.29 ± 0.69	88.73 ± 0.76	88.75 ± 0.30	89.96 ± 0.81	90.28 ± 0.15	90.10 ± 0.13	88.86 ± 0.56	88.89 ± 0.21	89.40 ± 0.24
Noise scenarios		Sym 60%			Sym 80%			Asym
Methods		Independent Mistakes	Effortless Workers	Correlated Mistakes	Independent Mistakes	Effortless Workers	Correlated Mistakes	Independent Mistakes	Effortless Workers	Correlated Mistakes
DL-MV	Best	70.18 ± 0.40	23.48 ± 1.06	69.32 ± 1.05	32.13 ± 1.36	10.02 ± 0.04	36.62 ± 0.98	53.62 ± 0.15	42.37 ± 0.56	53.00 ± 0.78
DL-MV	Last	41.16 ± 1.30	19.47 ± 0.96	37.49 ± 1.53	18.37 ± 0.72	10.00 ± 0.00	17.33 ± 0.30	52.27 ± 1.01	35.75 ± 0.79	50.32 ± 0.28
WDN	Best	71.26 ± 0.40	16.47 ± 1.71	55.72 ± 1.59	46.93 ± 1.58	10.23 ± 0.23	17.00 ± 0.95	68.78 ± 2.27	40.72 ± 3.99	66.09 ± 0.76
WDN	Last	45.93 ± 0.66	14.79 ± 4.19	36.05 ± 1.20	26.29 ± 9.14	10.16 ± 0.19	15.50 ± 1.35	52.56 ± 4.33	26.01 ± 4.12	55.27 ± 6.03
CrowdLayer	Best	73.56 ± 0.50	71.67 ± 0.27	57.70 ± 2.49	56.50 ± 1.55	41.79 ± 1.74	10.99 ± 0.94	81.48 ± 0.29	71.26 ± 4.29	72.41 ± 0.48
CrowdLayer	Last	51.81 ± 1.01	50.17 ± 1.02	48.14 ± 0.88	24.28 ± 1.28	13.72 ± 1.54	10.53 ± 0.56	80.91 ± 0.93	70.15 ± 5.11	71.54 ± 0.47
AggNet	Best	76.92 ± 0.67	76.43 ± 0.38	71.35 ± 0.64	59.79 ± 1.50	40.57 ± 3.19	44.49 ± 1.18	82.10 ± 0.41	78.35 ± 1.29	69.40 ± 0.36
AggNet	Last	55.49 ± 1.47	54.41 ± 0.16	44.32 ± 0.52	26.96 ± 1.07	25.46 ± 0.87	15.43 ± 0.99	80.43 ± 0.43	76.93 ± 1.19	65.56 ± 2.64
CoNAL	Best	73.80 ± 0.06	16.83 ± 9.56	74.06 ± 0.77	55.46 ± 0.98	10.00 ± 0.00	53.12 ± 0.45	79.96 ± 0.28	57.80 ± 4.37	80.29 ± 0.24
CoNAL	Last	47.82 ± 5.55	16.79 ± 9.49	47.02 ± 3.47	22.93 ± 0.92	10.00 ± 0.00	21.69 ± 0.48	79.18 ± 0.64	56.99 ± 3.97	80.04 ± 0.50
Max-MIG	Best	74.07 ± 0.23	74.04 ± 0.45	72.47 ± 0.76	55.07 ± 3.21	56.91 ± 0.93	49.73 ± 0.34	80.99 ± 0.24	80.25 ± 0.20	78.16 ± 0.27
Max-MIG	Last	61.00 ± 0.44	62.26 ± 0.28	54.50 ± 0.50	27.56 ± 1.39	29.16 ± 0.97	22.15 ± 0.16	80.05 ± 0.70	79.38 ± 0.29	76.21 ± 1.37
KD-Crowd	Stage 1	84.60 ± 0.51	86.66 ± 0.81	83.95 ± 0.44	68.94 ± 0.88	71.54 ± 0.57	47.75 ± 1.28	87.11 ± 1.09	88.34 ± 0.87	86.60 ± 0.88
	Stage 2	83.46 ± 0.13	83.43 ± 0.32	82.69 ± 0.96	73.22 ± 0.37	74.69 ± 0.21	51.10 ± 1.60	84.41 ± 0.89	84.92 ± 0.32	84.35 ± 0.58
	Ensemble	86.08 ± 0.10	87.17 ± 0.16	85.87 ± 0.15	72.23 ± 0.16	74.35 ± 0.33	51.74 ± 0.85	89.95 ± 0.08	89.60 ± 0.30	87.65 ± 0.25

Tab.2 Test accuracy (%) on real-world datasets LabelMe and CIFAR10H. For the baselines, we report both the best and last performance. For the proposed KD-Crowd, we report three results for better understanding. Bold values represent the best three methods, bold and underlined values represent the best methods

Datasets		LabelMe		CIFAR10H
Methods		Best	Last	Best	Last
DL-MV		81.23 ± 2.78	76.37 ± 0.67	63.87 ± 0.54	58.43 ± 0.80
WDN		83.78 ± 0.86	82.24 ± 0.37	69.43 ± 0.32	68.67 ± 0.45
CrowdLayer		85.63 ± 0.53	80.78 ± 0.39	67.26 ± 0.43	65.30 ± 1.17
AggNet		87.05 ± 0.24	83.87 ± 0.87	69.58 ± 0.30	67.94 ± 0.77
CoNAL		84.01 ± 0.36	79.97 ± 0.93	66.46 ± 0.52	61.53 ± 0.88
Max-MIG		87.40 ± 1.55	82.32 ± 0.59	67.87 ± 0.21	65.83 ± 0.24
KD-Crowd	Stage 1	89.22 ± 1.12		70.19 ± 0.51
	Stage 2	89.11 ± 0.21		69.99 ± 0.65
	Ensemble	89.03 ± 0.52		71.46 ± 0.28

Tab.3 Ablation study results of test and train accuracy (%) on CIFAR-10 with independent mistakes. We only report the last performance results of stages in KD-Crowd. The two rows of Initialization mean Max-MIG pre-trained for 10 epochs (row 1) and 50 epochs (row 2). Bold value represents the best method

Metric			Test accuracy			Train accuracy
Noise scenarios			Sym 60%	Sym 80%	Asym	Sym 60%	Sym 80%	Asym
Stages	Student type	Distillation method	Sym 60%	Sym 80%	Asym	Sym 60%	Sym 80%	Asym
Initialization	—	—	70.32 ± 0.02	49.48 ± 2.69	79.64 ± 0.41	78.01 ± 0.15	51.47 ± 2.96	90.26 ± 0.14
Initialization	—	—	61.00 ± 0.44	27.56 ± 1.39	80.05 ± 0.70	67.05 ± 0.24	31.49 ± 0.21	90.75 ± 0.07
Stage 1	ELR	—	72.87 ± 1.04	53.74 ± 0.92	78.23 ± 0.54	79.21 ± 0.30	53.65 ± 1.77	87.76 ± 0.36
Stage 1	ELR+	—	84.60 ± 0.51	68.94 ± 0.88	87.11 ± 1.09	86.21 ± 0.34	70.23 ± 0.29	91.26 ± 0.12
Stage 2	ELR → Di	KL Divergence	71.14 ± 0.20	50.92 ± 1.21	78.77 ± 0.42	77.28 ± 0.31	50.61 ± 0.71	87.61 ± 0.06
	ELR → M	KL Divergence	72.74 ± 0.24	51.93 ± 2.30	79.25 ± 0.18	77.90 ± 0.41	52.04 ± 2.20	87.86 ± 0.25
	ELR → M	$M I G^{f}$	79.12 ± 0.48	65.67 ± 2.21	82.22 ± 0.43	84.13 ± 0.58	67.40 ± 2.12	89.59 ± 0.26
	ELR+ → M	$M I G^{f}$	81.87 ± 0.45	70.09 ± 0.70	82.94 ± 0.60	85.74 ± 0.09	69.98 ± 0.49	90.87 ± 0.03
	ELR+ → M	$M I G^{f}$ , Disturb	83.46 ± 0.13	73.22 ± 0.37	84.41 ± 0.89	88.56 ± 0.07	72.89 ± 0.09	92.79 ± 0.13
Ensemble	ELR+ → M	$M I G^{f}$ , Disturb	86.08 ± 0.10	72.23 ± 0.16	89.95 ± 0.08	88.73 ± 0.14	72.06 ± 0.14	93.26 ± 0.03

For the two parameters

τ, λ

of KD-Crowd, we set

τ = 1

, for which three values 0.5, 1, and 2 were tested under the scenario of CIFAR-10 with Sym

60 %

independent mistakes, resulting in no much performance difference. Thus we fixed it as 1 for all experiments for simplicity.

λ

is set as 1 because we treated the distillation and crowdsourcing loss equally. The magnitudes and layers of RandAugment for each batch are chosen from

{9, 12, 15}

and

{3, 5}

randomly. Finally, we repeat experiments three times, then report the best and last performance results (mean ± std) for baseline methods, and the last performance for KD-Crowd including the students in Stage 1, 2 and their ensembles.

4.2 Comparison for classification accuracy

Tab.1 and Tab.2 respectively show results for the synthesized and real-world datasets. To illustrate the overfitting behavior of learning models, we report both the best and last performance for the baselines. For the proposed two-stage approach KD-Crowd, we report three results for better understanding: Stage 1 means the test performance of the model trained in Stage 1, Stage 2 means the test performance of the model trained in Stage 2, Ensemble means the performance by ensembling the prediction given by Stage 1 and Stage 2. We use bold values to represent the best three results among all methods, with underlined and bold values denoting the best results.

Tab.1 shows results for the instance dependent

20 %

noise case and

20 %, 40 %

,60%, 80% symmetric noise case and asymmetric noise case. Built on Max-MIG, the three reported results of KD-Crowd are the best three among all methods in almost all cases. In some cases, Stage 2 is inferior compared to Stage 1, but still better than the best performance of its base model Max-MIG. It might be due to that the transition matrix estimation of Stage 2 encounters mistakes, which in turn propagate to the true label predictions and results in degeneration. The ensemble of the two stages obtains the best results in most cases. Specifically, it improves over the best performance of Max-MIG with 4%−18% absolute test accuracy in most cases. We claim that the superiority of KD-Crowd comes from two aspects: first, the models in KD-Crowd are trained with much fewer epochs. Such early stopping training is critical to avoid overfitting the label noise; second, the best but still inferior performance of the baselines confirms the necessity of the knowledge distillation and refinement framework.

For the baselines, we can see an obvious gap between their best and last performance, indicating that they easily memorize and overfit the annotation noise. Besides, among the baselines, DL-MV gets poor performance due to treating all workers equally. Specifically, it’s not robust in cases with effortless workers. WDN models the worker expertise using one single real-valued number and achieves limited improvement. CrowdLayer and AggNet use transition matrices, which are more informative with improved model capacity and improved over WDN. CoNAL extends CrowdLayer by modeling workers’ shared mistakes and obtains improvements on correlated mistakes. However, it is not robust on effortless workers because they should be viewed as being independent, and it’s harmful to model the shared mistakes between junior and senior workers. Max-MIG trains a DNN classifier and an annotation aggregator by maximizing their

f

-mutual information gain without explicitly assuming the workers’ relationship and is more robust on effortless workers and correlated mistakes. In some cases, the student model in Stage 2 degenerates on true label prediction might be due to that, it assumes an oversimplified class-dependent transition matrix noise model, whose estimation error occurs and propagates to true label predictions.

For the two real-world datasets in Tab.2, we observe similar results as the synthesized data case. The transition matrices based worker expertise modeling crowdsourcing methods overfit the annotation noise and achieve inferior performance. KD-Crowd obviously improves Max-MIG and outperforms the other methods, which validates its effectiveness in real-world situations. Please note that the results of baselines are different from their original reported ones. This is because KD-Crowd used the raw image features rather than the extracted ones given by CrowdLayer, such that image augmentation can be reasonably conducted. For a fair comparison, the same raw images, backbone network and optimizer are used for the baselines, which brought improvement for all baselines’ best performance except CoNAL (84.01 vs 87.12 on LabelMe). But even the original better 87.12 test accuracy is not comparable to KD-Crowd.

4.3 Comparison for estimating transition matrix

In this subsection, we show the worker expertise modeling comparison between Max-MIG and the proposed KD-Crowd. As Max-MIG doesn’t directly output the noise transition matrix but the inverse noise transition matrix, we calculate the estimated transition matrix

{\hat{T}}^{m}

using the DNN classifier’s predictions and the annotations of worker

m

. For the two real-world datasets, their ground-truth transition matrix

T^{m}

is calculated by using the true labels and the annotations of workers.

For each worker, we calculate the entry-wise estimation error matrix as

T_{e}^{m} = | T^{m} - {\hat{T}}^{m} | / T^{m}

. Fig.3 shows the worker-wise error matrix of Max-MIG and KD-Crowd on CIFAR-10 with

60 %

symmetric independent mistakes. Fig.4 gives the comparison for CIFAR10H and LabelMe. It plots the gap between the overall error

E^{m}

of Max-MIG and KD-Crowd on each worker, which is calculated as the mean value of all entries of the estimation error matrix

T_{e}^{m}

. It can be seen that KD-Crowd achieves much lower transition matrix estimation error on most workers compared to Max-MIG.

Fig.3 Illustrative results of the entry wise transition matrix estimation error $T_{e}^{m}$ on CIFAR-10 with $60 %$ symmetric independent mistakes, which is calculated as $| T^{m} - {\hat{T}}^{m} | / T^{m}$ , with $T^{m}$ , ${\hat{T}}^{m}$ respectively denoting the groundtruth and estimated transition matrix. Darker (lighter) entries mean larger (smaller) errors

Full size|PPT slide

Fig.4 The gap between the overall error $E^{m}$ of Max-MIG and KD-Crowd for all workers on CIFAR10H (a) and LabelMe (b). $E^{m}$ is calculated as the mean value of entries in $T_{e}^{m} = | T^{m} - {\hat{T}}^{m} | / T^{m}$

Full size|PPT slide

4.4 Ablation study

To examine the significance of each component, we design baselines from several aspects as following:

● Stages: includes pretraining, Stage 1, and 2 of KD-Crowd, and ensemble phase for the testing phase.

● Student type: which student model is used in each stage. In Stage 1, we test ELR+ and its degenerated implementation ELR which uses one network classifier without semi-supervised learning techniques. In Stage 2, we test only distillation (Di), Max-MIG (M). “Teacher → Student”, e.g., “ELR → M” means we select ELR as the teacher, Max-MIG as the student in Stage 2.

● Distillation method: which distillation loss and input instance strategy is used. We test KL divergence,

M I G^{f}

, and the effect of extra disturbed input instances (Disturb) for Eq. (7).

Here we only report the test and train accuracy of baselines over CIFAR-10 with independent mistakes in Tab.3. The results for other noise scenarios are similar and omitted.

At pretraining, we compare Max-MIG trained for 10 epochs (row 1) with 50 epochs (row 2) and find that performance decreases in most cases, indicating that the DNN classifier overfits annotation noise due to the inaccurate worker’s expertise modeling. Thus for other experiments, we take the model trained for 10 epochs as the teacher in Stage 1.

In Stage 1, we train a noise-model-free robust student model to refine the teacher’s true label predictions. With ELR, the DNN classifier gets improved in most cases. And ELR+, which is stronger than ELR, can correct more mistakes and better refine the true label predictions.

In Stage 2, we jointly train the crowdsourcing model using the refined true label predictions and the crowds’ annotations. Because the noise-model-free robust model in Stage 1 may even introduce new mistakes in the refined true label predictions, only mimicking it using KL divergence loss will make the student model overfit the additional noise. Thus it is imperative to undertake retraining on the crowd’s annotations utilizing robust knowledge distillation loss. This assertion is substantiated by the outcomes of our empirical investigations.

4.5 Generality of the proposed framework

In the above, we give one typical implementation by using ELR+ to refine Max-MIG. Actually, any effective noisy-model-free methods can be used to refine the label predictions for any crowdsourcing methods using the proposed KD-Crowd framework. We further validate the generality of KD-Crowd with another representative crowdsourcing model CrowdLayer [4] and a noisy-model-free method DivideMix [15].

CrowdLayer appends a layer of coefficients for each worker after the output layer of the DNN classifier to characterize the annotator expertise, with the final output layer corresponding to the crowdsourcing annotation prediction. It optimizes all parameters end-to-end by minimizing the annotation prediction loss. DivideMix uses two networks to perform noisy sample detection for each other based on a two-component Gaussian mixture model over sample losses, then treats the noisy samples as unlabeled data and applies a superior semi-supervised learning technique.

Tab.4 shows the results for four implementations of KD-Crowd on the two real-world datasets LabelMe and CIFAR10H. We report both the test accuracy and train accuracy results. The same setting is adopted as the previous Max-MIG/ELR+ implementation, i.e., the epochs of the pretraining and the Stage 1, 2 in KD-Crowd are

5

60

10

on LabelMe, and

20

60

20

on CIFAR10H. It can be seen that CrowdLayer and Max-MIG are significantly improved by using our knowledge distillation refinement framework.

Tab.4 Different implementation of KD-Crowd on real-world datasets LabelMe and CIFAR10H. Bold values represent the best methods

Metirc			Test accuracy		Train accuracy
Datasets			LabelMe	CIFAR10H	LabelMe	CIFAR10H
Methods		Stage	LabelMe	CIFAR10H	LabelMe	CIFAR10H
CrowdLayer	ELR+	Stage 1	79.53 ± 3.58	69.77 ± 0.47	86.28 ± 4.72	80.65 ± 1.21
		Stage 2	82.21 ± 0.65	69.67 ± 1.63	83.27 ± 0.57	77.65 ± 0.82
		Ensemble	80.78 ± 3.00	72.10 ± 0.80	—	—
CrowdLayer	DivideMix	Stage 1	83.42 ± 0.67	68.07 ± 2.23	85.80 ± 2.10	77.83 ± 3.34
		Stage 2	82.18 ± 0.95	64.40 ± 6.90	76.43 ± 0.87	72.60 ± 8.14
		Ensemble	86.56 ± 0.98	71.73 ± 1.37	—	—
Max-MIG	ELR+	Stage 1	89.33 ± 0.24	69.43 ± 0.32	86.69 ± 4.72	79.60 ± 1.29
		Stage 2	87.77 ± 0.31	68.67 ± 0.45	91.63 ± 1.07	79.67 ± 3.68
		Ensemble	89.79 ± 0.10	71.46 ± 0.28	—	—
Max-MIG	DivideMix	Stage 1	84.37 ± 1.99	66.73 ± 0.97	87.90 ± 1.15	73.13 ± 0.46
		Stage 2	88.97 ± 0.42	71.27 ± 1.33	91.17 ± 0.63	77.39 ± 0.21
		Ensemble	88.24 ± 1.66	70.43 ± 1.27	—	—

5 Conclusion

In this paper, we propose a general knowledge distillation framework named KD-Crowd for crowdsourcing learning. Considering that current worker expertise modeling crowdsourcing methods commonly overfit annotation noise either due to the oversimplified noise pattern assumption, or inaccurate estimation, we propose to improve them by combining the complementary strength of noise-model-free robust learning techniques, which makes no assumption over the underlying noise pattern, but handles the label noise by exploring the dynamic process of optimization policies. We apply KD-Crowd to two different crowdsourcing models and achieve significant improvements over extensive experiments. In the future, we plan to test the effectiveness of this framework on regular single-source noisy label learning scenarios with complex instance-dependent noise.

Shaoyuan Li is an associate professor in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics, China. She received BSc and PhD degrees in computer science from Nanjing University, China in 2010 and 2018, respectively. Her research interests include machine learning and data mining. She has won the Champion of PAKDD’12 Data Mining Challenge, the Best Paper Award of PRICAI’18, 2nd place in Learning and Mining with Noisy Labels Challenge at IJCAI’22, the 4th place in the Continual Learning Challenge at CVPR’23

Yuxiang Zheng received the BSc degree in computer science from Zhejiang University of Technology, China in 2022. Currently, he is working toward an MS degree in the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China. His research interests include continual learning and crowdsourcing

Ye Shi received the BSc degree in computer science from the China University of Mining and Technology, China in 2019, and the MS degree from Nanjing University of Aeronautics and Astronautics, China in 2022. His research focuses on crowdsourcing and multi-label classification

Shengjun Huang received the BSc and PhD degrees in computer science from Nanjing University, China in 2008 and 2014, respectively. He is now a professor in the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics, China. His main research interests include machine learning and data mining. He has been selected to the Young Elite Scientists Sponsorship Program by CAST in 2016, and won the China Computer Federation Outstanding Doctoral Dissertation Award in 2015, the KDD Best Poster Award in 2012, and the Microsoft Fellowship Award in 2011. He is a Junior Associate Editor of Frontiers of Computer Science

Songcan Chen received a BS degree in mathematics from Hangzhou University (now merged into Zhejiang University), China in 1983, and a MS degree in computer applications from Shanghai Jiaotong University, China in 1985, and then worked with NUAA in January 1986. He received the PhD degree in communication and information systems from the Nanjing University of Aeronautics and Astronautics, China in 1997. Since 1998, as a full-time professor, he has been with the College of Computer Science and Technology, NUAA, China. His research interests include pattern recognition, machine learning, and neural computing. He is also an IAPR fellow

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Snow R, O’Connor B, Jurafsky D, Ng A Y. Cheap and fast - but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of Conference on Empirical Methods in Natural Language Processing. 2008, 254−263

[2]	Raykar V C, Yu S, Zhao L H, Valadez G H, Florin C, Bogoni L, Moy L . Learning from crowds. The Journal of Machine Learning Research, 2010, 11: 1297–1322

[3]	Albarqouni S, Baur C, Achilles F, Belagiannis V, Demirci S, Navab N . AggNet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Transactions on Medical Imaging, 2016, 35( 5): 1313–1321

[4]	Rodrigues F, Pereira F. Deep learning from crowds. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 30th Innovative Applications of Artificial Intelligence Conference, and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2017, 197

[5]	Yang Y, Wei H, Zhu H, Yu D, Xiong H, Yang J. Exploiting crossmodal prediction and relation consistency for semisupervised image captioning. IEEE Transactions on Cybernetics, 2022, doi: 10.1109/TCYB.2022.3156367

[6]	Dawid A P, Skene A M . Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 1979, 28( 1): 20–28

[7]	Cao P, Xu Y, Kong Y, Wang Y. Max-MIG: an information theoretic approach for joint learning from crowds. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[8]	Natarajan N, Dhillon I S, Ravikumar P, Tewari A. Learning with noisy labels. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 1196−1204

[9]	Arpit D, Jastrzebski S, Ballas N, Krueger D, Bengio E, Kanwal M S, Maharaj T, Fischer A, Courville A, Bengio Y, Lacoste-Julien S. A closer look at memorization in deep networks. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 233−242

[10]	Gui X-J, Wang W, Tian Z-H. Towards understanding deep learning from noisy labels with small-loss criterion. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 2469−2475

[11]	Jiang L, Zhou Z, Leung T, Li L-J, Li F-F. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2304−2313

[12]	Han B, Yao Q, Yu X, Niu G, Xu M, Hu W, Tsang I W, Sugiyama M. Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8536−8546

[13]	Yu X, Han B, Yao J, Niu G, Tsang I, Sugiyama M. How does disagreement help generalization against label corruption? In: Proceedings of the 36th International Conference on Machine Learning. 2019, 7164−7173

[14]	Malach E, Shalev-Shwartz S. Decoupling “when to update” from “how to update”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 961−971

[15]	Li J, Socher R, Hoi S C H. DivideMix: learning with noisy labels as semi-supervised learning. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[16]	Liu S, Niles-Weed J, Razavian N, Fernandez-Granda C. Early-learning regularization prevents memorization of noisy labels. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1707

[17]	Song H, Kim M, Lee J G. SELFIE: refurbishing unclean samples for robust deep learning. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5907−5915

[18]	Liu Q, Peng J, Ihler A T. Variational inference for crowdsourcing. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 692−700

[19]	Zhou D, Platt J C, Basu S, Mao Y. Learning from the wisdom of crowds by minimax entropy. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 2195−2203

[20]	Rodrigues F, Pereira F C, Ribeiro B. Gaussian process classification and active learning with multiple annotators. In: Proceedings of the 31st International Conference on Machine Learning. 2014, 433−441

[21]

Guan M Y, Gulshan V, Dai A M, Hinton G E. Who said what: modeling individual labelers improves classification. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. 2018

[22]	Tanno R, Saeedi A, Sankaranarayanan S, Alexander D C, Silberman N. Learning from noisy labels by regularized estimation of annotator confusion. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 11236−11245

[23]

Chu Z, Ma J, Wang H. Learning from crowds by modeling common confusions. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence AAAI 2021, 33rd Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, and 11th Symposium on Educational Advances in Artificial Intelligence. 2021, 5832−5840

[24]	Li S-Y, Huang S-J, Chen S . Crowdsourcing aggregation with deep Bayesian learning. Science China Information Sciences, 2021, 64( 3): 130104

[25]	Shi Y, Li S-Y, Huang S-J . Learning from crowds with sparse and imbalanced annotations. Machine Learning, 2023, 112( 6): 1823–1845

[26]	Li S-Y, Jiang Y, Chawla N V, Zhou Z-H . Multi-label learning from crowds. IEEE Transactions on Knowledge and Data Engineering, 2019, 31( 7): 1369–1382

[27]	Lee K, Yun S, Lee K, Lee H, Li B, Shin J. Robust inference via generative classifiers for handling noisy labels. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3763−3772

[28]	Yao Y, Liu T, Han B, Gong M, Deng J, Niu G, Sugiyama M. Dual T: reducing estimation error for transition matrix in label-noise learning. In: Proceedings of the 34th Conference on Neural Information Processing Systems. 2020, 7260−7271

[29]	Ghosh A, Kumar H, Sastry P S. Robust loss functions under label noise for deep neural networks. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1919−1925

[30]	Zhang Z, Sabuncu M R. Generalized cross entropy loss for training deep neural networks with noisy labels. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8792−8802

[31]	Ma X, Huang H, Wang Y, Erfani S R S, Bailey J. Normalized loss functions for deep learning with noisy labels. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 607

[32]	Li M, Soltanolkotabi M, Oymak S. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. 2020, 4313−4324

[33]	Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015. arXiv preprint arXiv: 1503.02531

[34]	Zhou Z-H, Jiang Y, Chen S-F . Extracting symbolic rules from trained neural network ensembles. AI Communications, 2003, 16( 1): 3–15

[35]	Zhou Z-H, Jiang Y . NeC4.5: neural ensemble based C4.5. IEEE Transactions on Knowledge and Data Engineering, 2004, 16( 6): 770–773

[36]	Li N, Yu Y, Zhou Z-H. Diversity regularized ensemble pruning. In: Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2012, 330−345

[37]	Li Y, Yang J, Song Y, Cao L, Luo J, Li L-J. Learning from noisy labels with distillation. In: Proceedings of IEEE International Conference on Computer Vision. 2017, 1928−1936

[38]	Zhang Z, Zhang H, Arik S Ö, Lee H, Pfister T. Distilling effective supervision from severe label noise. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9291−9300

[39]	Yang Y, Zhan D-C, Fan Y, Jiang Y, Zhou Z-H. Deep learning for fixed model reuse. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2831−2837

[40]	Xie Q, Luong M T, Hovy E, Le Q V. Self-training with noisy student improves ImageNet classification. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 10684−10695

[41]	Cubuk E D, Zoph B, Shlens J, Le Q V. Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020, 3008−3017

[42]	Krizhevsky A. Learning multiple layers of features from tiny images. University of Toronto, Dissertation, 2009

[43]	Xia X, Liu T, Han B, Wang N, Gong M, Liu H, Niu G, Tao D, Sugiyama M. Part-dependent label noise: towards instance-dependent label noise. In: Proceedings of the 34th Conference on Neural Information Processing Systems. 2020, 7597−7610

[44]	Peterson J C, Battleday R M, Griffiths T L, Russakovsky O. Human uncertainty makes classification more robust. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 9617−9626

[45]	He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778

[46]	Ma N, Zhang X, Zheng H-T, Sun J. ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 122−138

Acknowledgements

This work was supported by the National Key R&D Program of China (2022ZD0114801), the National Natural Science Foundation of China (Grant No. 61906089), and the Jiangsu Province Basic Research Program (BK20190408).