1 Introduction
Due to the effectiveness of the graph neural networks (GNNs), it gets to be the first choice for graph-based problems. In practice, we often face the issues of agnostic graph data pollution, such as deliberate adversarial attacks or annotation mistakes by accident [
1]. For example, we often use crowd-sourcing platforms to annotate data. It is inevitable for the annotator to make mistakes and introduce label noise [
2]. In this case, GNNs is confronted with the problem of over-fitting to noisy graph data and suffers from severe performance degradation [
3,
4]. In this paper, we focus on mitigating the negative influence of potential label noise among graph data.
To train robust graph neural networks in the presence of label noise, researchers have developed strategies such as sample re-weighting and label correction. Meanwhile, given the interdependence of samples or nodes in graph data, it is important to consider the effects imposed by the underlying topology. To this end, recent works such as [
5] and [
6] incorporate neighborhood information during the label correction and sample re-weighting processes. Other approaches aim to protect representations of nodes from the same class against label noise. For example, [
7] constructs positive node pairs, while [
8] adjusts the graph topology by adding edges between labeled and unlabeled nodes. However, it is worth noting that these methods have the potential to introduce additional noise by revising the already noisy graph data. In this case, it may lead to a performance drop and even result in more incorrect predictions.
A feasible alternative is to select the right data for model training instead of revising noisy graph data and introducing extra noise. The aforementioned methods still train GNNs with all the observed labels and try not to miss the right part of the dataset. Let us think about it another way. To learn a better model, is it equally effective to utilize the right subset of labeled nodes covering as less noise as possible? Thus, we estimate the impact of incorrect labels in training GCN [
9] for the node classification task. As can be seen from Fig.1(a), the prediction accuracy on the test dataset continues to drop down with the increase of noise rate, while if we train GCN on the subset of the training dataset by masking all incorrect labels, there is only slight degradation in the performance accordingly.
Fig.1 (a) The performance of GCN under different levels of noise rate on Cora and Citeseer. The “-NF” dataset is derived from the noisy dataset by eliminating the noise labels; (b) the loss curve of GCN trained on Cora under noise rate=0.3 |
Full size|PPT slide
To this end, we look deeper into the training of GNNs on noisy graph data and find that there are differences between mislabeled nodes and correct nodes in the training process. It is observed, in Fig.1(b), that the average loss on mislabeled nodes is larger than that on correct nodes. We notice that mislabeled nodes cause the
prediction deviation from the correct nodes and the deviation tends to increase as the model trains. A similar phenomenon has been discovered in the image field [
10]. Compared to image data, there are connections between nodes in graph data. Meanwhile, graph neural networks update node representations based on the graph structure. So we further revisit the training differences between mislabeled nodes and correctly labeled nodes from the perspective of node neighborhoods. Specifically, for nodes in the graph, we analyze from both local and global structural perspectives. 1)
Local view. We measure the local relative relationship by estimating the distribution distance of node representations between the node and its surrounding neighbors. 2)
Global view. We estimate the global relative relationship by measuring the change in the relative position ranking of the node compared to similar nodes in the original feature space. Fig.2(a) and Fig.2(b) respectively show the changes in node-related KL-divergence and NDCG during the training process. There are obvious deviations between mislabeled nodes and correctly labeled nodes from both local and global views. To be consistent, we further call these two kinds of deviations as
local deviation and
global deviation, respectively.
Fig.2 (a) The KL-divergence of the prediction distributions between nodes and local surrounding nodes; (b) the NDCG@10 of ranking similar nodes based on feature space in the output space. We get the results on Cora with GCN when the noise rate is 0.3 |
Full size|PPT slide
Based on the above analysis, we discover three types of deviations between mislabeled nodes and correct nodes. These training statistic features uncover the potential impact of label noise from the views of prediction as well as the local and global neighborhood, which can be further utilized for data selection.
Thus, we propose a novel framework to reduce the influence of label noise with self-adaptive data utilization. In the training procedure, we introduce a self-adaptive sample network and select the subset of labeled nodes based on current training states. Specifically, it learns the sample weight by considering prediction, local, and global deviation and prunes labeled nodes according to the output weights. We further balance selected labeled nodes with the corresponding weights and assign learned weights to selected nodes with loss correction for model training. What is remarkable is that the sample network realizes dynamic data selection and can avoid selection error accumulation. Moreover, it can design a flexible training mechanism. For example, the more low-weighting nodes are selected as the training processing and it is effective to avoid the model over-fitting.
In summary, the contributions are three folds:
● We analyze the training states of GNNs trained with noisy data and find that there are noticeable deviations from training states for mislabeled nodes.
● We propose a novel framework to introduce a self-adaptive sample network for dynamic data selection and mitigate the negative impact of label noise, denoted by Soft-GNN.
● We conduct lots of experiments to verify the effectiveness of the proposed framework and give new insight into the improvements.
The rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 introduces the preliminary and addresses the challenge. Next, we provide the solution and illustrate the proposed framework in Section 4. Then, we present the details of the experiments and analyze the experimental results in Section 5. In the end, we conclude and discuss the future work in Section 6.
2 Related works
2.1 Label noise
The label noise is a common problem in the classification task, which might lead to serious consequences, such as performance degradation or model corruption [
11]. The incorrect label is derived from the inaccurate manual or machine annotation, such as in medical data [
12,
13], image data [
14,
15], and language corups [
16]. How to learn with noise labels is getting the interest of researchers. Some studies focus on the noise-robust model, such as SVM [
17] and decision tree [
18]. Other studies pay more attention to the learning algorithm with noisy label [
19]. The typical learning paradigm is to distinguish the noisy label and reduce the effect of the label noise during training. How to learn robust models under label noise is widely studied in the image field. The methods fall into two categories: data-based and model-based. The former filters the noisy data from the perspectives of the data generation, such as the Bayesian approach [
20] or cluster-based method [
21]. These methods can tell the possible noisy data first and learn the model on the filtered clean dataset [
22].
The model-based approaches recognize the potential incorrect label according to the model predictions [
23-
25] and can be applied in training deep neural networks under label noise in an end-to-end manner. There are three main categories of the training strategies: sample re-weighting [
26-
30], label correction [
10,
31,
32], and cooperative training [
33-
35]. Sample re-weighting treats the noisy sample and correct sample differently and reassigns the weight of samples. [
26] adopts the Bayesian-based approach for re-weighting, while [
27,
28] introduce a neural network to learn the weight for each sample. In addition, [
29,
30] abandon the potential noisy sample during training. The method of label correction revises the given labels. [
10] proposes a method to update the label with the prediction to soft-label and reassign the weight of the sample at the same time. [
31] introduces a meta-model as the label correction network. Cooperative training trains two models to supervise the training of each other. Decoupling [
33] updates the model when the two models make different predictions and co-teaching [
34] updates one model with the samples with a small loss from the other model.
2.2 GNNs with label noise
Recently, graph neural networks have achieved promising results for graph-based tasks, such as GCN [
9] and APPNP [
36]. Generally, the given labels are treated as true labels when training GNNs and there is little attention to training GNNs with label noise. For graph-level classification, some strategies in the image field can be directly used. For example, [
3] adopts loss correction by estimating the correction matrix. For the node-level classification task, the nodes are connected and dependent on graph data. Meanwhile, GNNs include feature propagation and aggregation, and the influence of label noise could spread to other samples through the edge. Because of this, researchers design personalized strategies for training GNNs under the label noise [
5-
8,
37,
38]. UnionNET [
5] constructs a support node set for each training node and proposes a uniform framework with sample re-weighting and label correction. [
6] corrects the label by pre-training the model and utilize the revised labels to train the GNNs model. Pi-gnn [
7] constructs node pairs with positive pairwise interaction among all nodes and encloses the embedding of node pairs against label noise. NRGNN [
8] introduces an edge predictor and adds edges between the labeled node and unlabeled node to reduce the effect of label noise. CLNode [
37] explores curriculum learning for node classification, a technique that improves the performance of node classification by sequentially training models on increasingly challenging subsets of the dataset. MTS-GNN [
38] introduces a multi-teacher self-training strategy for semi-supervised node classification with noisy labels, enhancing classification performance by utilizing knowledge from multiple teachers and noisy labels.
In this paper, we focus on the node-level classification under label noise. We discover some deviations between the mislabeled nodes and clean nodes, including prediction deviation, local deviation, and global deviation. Based on these observations, we introduce a sample network by using the above-mentioned training statistics and realize the self-adaptive data utilization during training.
Compared with the state-of-the-art baselines, the proposed method does not correct the labels or adjust the graph topology. The label correction has a margin of error and the error might accumulate as the training continues. Likewise, it takes much time to adjust the graph topology and the effect of the adjustment is uncertain. By contrast, the proposed framework shows flexibility in sample selection and can tailor an appropriate learning strategy to the current training state. It can avoid error accumulation promptly and achieves a steady improvement in performance with less extra time overhead.
3 Preliminary and challenge
3.1 Preliminary
Given a graph , is the set of nodes, is the edge set, and A represents the adjacency matrix. For a node pair , =1 if , and 0 otherwise. The graph neural networks are designed to solve the graph-based analysis tasks and achieve promising results. For the node classification task, it is usual that the graph is partially labeled and the labels might be noisy. We denote the GNNs model as with parameters and the output of GNNs is derived as:
where X is the feature matrix. Z is the output of the final GNNs layer and the prediction is . Since partial nodes are labeled, the GNNs model is trained in a semi-supervised way. Thus, the objective is:
where is the set of training nodes and is the observed label.
3.2 Challenge
In this paper, our goal is to eliminate the negative influence of noisy labels during the training process of GNNs. We consider the situation that the observed labels may be mislabeled, which means
and
is the ground truth. To better illustrate the challenge but still capture the essence of graph convolutions, we follow [
4] to perform a linearization of GNNs. Thus, taking a two-layer GNN as an example, we replace the non-linearity functions
with simple linear functions, that is:
Since and are free learnable weights, they can be absorbed in a single weight . Typically, can be learned through back propagation using the observed labels in a semi-supervised manner. However, due to the noise of observed labels, which can be viewed as perturbations of the real labels, the distributions of observed labels and real labels may be different, i.e., it holds:
Equation (4) tells us the shift of label distribution causes a synchronous shift of the model parameters, we call this phenomenon as
distribution shift. Given this, we aim to use variable balancing [
39] to joint learning a global sample weight matrix
and optimizing
. Referring to [
39], we consider the stability of predictions made by the network when trained on noisy graph data. However, instead of directly working with the stability of predictions, we focus on the stability of the graph representation learned by the network. This is where the sample weight matrix
comes into play. Thus, we have the formulation as follow:
According to Eq. (5), is a positive definite matrix. If , . In other words, the sample weights guarantees that the model is robust to noisy labels because it makes it possible for the GNNs to prevent distribution shift. Note that the label noise is agnostic, in practice, the is not accessible. Moreover, the noise distribution depends on various factors, and directly learning the distribution is unrealistic. To this end, the biggest challenge to learning a robust GNN is to estimate the sample weight matrix .
It is more likely to learn a better model when the label distribution is closer to the true distribution. According to Fig.1(a), it has a better performance by directly masking the incorrect labels when training. Then, there is no need to learn
to guarantee that
is equal to
. Researches [
40,
41] suggest that there is no big difference in the performance of the model learned on partial important samples and full dataset. [
41] proposes a score function to identify important samples and finds that a pruned dataset is also able to reduce the effect of the label noise. Thus, It provides an idea of optional data utilization to mitigate the data distribution shift caused by label noise and guarantee the GNNs model performance.
4 Method
4.1 Overview
We propose a novel framework, named Soft-GNN, to introduce a sample network to select data and mitigate the influence of label noise with self-adaptive data utilization. To this end, we analyze the training states of mislabeled nodes and discover that the mislabeled nodes deviate from other labeled nodes. There are three levels of deviation: prediction, local, and global deviation. Based on that, we estimate the sample weights with the training deviation features for data selection and re-weight samples for model learning. The architecture of Soft-GNN is illustrated in Fig.3.
Fig.3 The overall framework of Soft-GNN |
Full size|PPT slide
4.2 The deviation by label noise
4.2.1 Prediction deviation
It is discovered that the mislabeled sample is difficult to learn by the deep neural networks in previous studies [
10]. Fig.1(b) witnesses a similar phenomenon on GNNs. The loss on clean nodes falls more quickly compared to that on mislabeled nodes. As the process of training, the prediction gap becomes more pronounced. Thus, the mislabeled nodes are more likely to be the hard samples when training GNNs.
Definition 4.1 For each labeled node , the prediction deviation is derived as the cross entropy between the model prediction and the label, represented by . The calculation formula is as follow:
where is the number of classes and is the prediction probability of belonging to class .
For graph data, the nodes are connected and dependent. We further explore whether mislabeled nodes disrupt the inherent relationships with neighboring nodes and cause deviations in output space. Theoretically, GNNs follow the paradigm of feature propagation and aggregation, and lead to feature smoothing. It means the node can be reconstructed by the neighboring nodes. It is also verified that feature smoothing theoretically guarantees label smoothing from the view of label propagation [
42]. In other words, there is consistency in the reconstruction relations between feature space and output space generated by the learned model. Then, the label noise would change the relations with neighboring nodes in the output space. It is found that there are local deviation and global deviation for mislabeled nodes.
4.2.2 Local deviation
From the topology of the graph, it can be seen from Fig.2(a) that the mislabeled nodes are far away from their surroundings in output space.
To estimate the prediction divergence between the node and its surrounding nodes, the first is to generate the neighborhood embedding. If we consider all surrounding nodes, the divergence may not be introduced by the label noise but by the class variance among the surrounding nodes. Then, we select the nodes with high structural dependency as the structural neighborhood to reduce the influence of class variance. The higher the structural dependency between two nodes is, the more common surrounding nodes are. We utilize the mutual information to estimate the structural dependency [
43]. Let
denote a set of the variables for nodes in the graph and
represent randomly selecting one node from the graph for node
. Then,
denotes the neighborhood of node
and
is the complementary set of
. In this case,
has two possible values {0,1}, where 0 indicates the selected node is from
and 1 indicates the selected node belongs to
. Thus,
and
.
Hence, the structural dependency between and is estimated by the mutual information between and , computed by
where is the joint distribution. Considering that both and have two possible values, () have four combinations. The probabilities are as follows:
For a training node , we can compute the structural dependency between and each neighboring node . The top-k nodes are selected as the structural neighborhood based on the value of structural dependency, denoted by . Then, we generate the neighborhood embedding by aggregating the representations of nodes from . For example, we use the mean aggregator to get the neighborhood embedding and the class distribution is derived as
Definition 4.2 For each labeled node , the local deviation is derived as node-to-neighborhood divergence, represented by , and estimated by the KL-divergence between the class distributions of the node and its structural neighborhood. The calculation formula is as follow:
4.2.3 Global deviation
Furthermore, we explore whether mislabeled nodes deviate from neighboring nodes in a global view. Then, we expect to estimate the change of nodes’ relative placements from the feature space to the output space. Fig.2(b) outlines that the changes in the relative relations of mislabeled nodes are much greater and more unstable than those of clean nodes. Specifically, we have a look at the similarity ranking variance of the node list. Firstly, we get a node sequence for each sample node, where the nodes are ordered in the similarity in the feature space. Then, we reset the sequence by the similarity in the output space of the learned model and get the new ordered node list. The smaller value means the greater changes of the node’s relative relationship in output space as compared to the relations in feature space. Thus, there is a ranking deviation for mislabeled nodes.
For a target node , we compute the distance between nodes by the cosine similarity and get a node list in descending order of the distance to in feature space. For simplicity, the length of the node sequence is limited to , and the similarity sequence {,..., } in feature space is denoted by . To target node , the Discounted Cumulative Gain (DCG) of is calculated by
Accordingly, we can get a reordering node list for nodes from and a new similarity sequence ={,..., } in output space. Similarly, we can compute the DCG of , denoted by .
Definition 4.3 For each labeled node , the global deviation is estimated by the relative position change of node from the feature space to the output space, denoted as . Then, the change of the node’s relative placements from the feature space to the output space is calculated as follow:
4.3 Self-adaptive sample network
The aforementioned training deviation could be used as a feature to tell the potential mislabeled nodes from the noisy data. Thus, we introduce a sample network to select data and realize the adaptive data utilization to guide the training procedure of GNNs. Based on the previous training deviation, the sample network outputs weight for each node and suggests the optimal data utilization for the next training step.
At each training step
, we can get the corresponding deviation features
[
] for
. We choose the sequence of training features in the time window [
] as the input of the sample network, denoted by
. Furthermore, we expect to obtain the relative position of each node among the training set on each feature dimension. Inspired by [
27], we get the
, where
is to compute the difference between the original value and the percentile. To normalize each training feature and account for its relative importance, we independently calculate its percentile based on the data distribution. This approach helps to equalize the influence of different features and improve the overall performance of the model. Then, the training features for node
is the concatenation of
and
.
To aggregate information across the
time windows, we adopt recurrent neural networks, such as LSTM [
44]. Meanwhile, we add another feature
to indicate the training process, and the value is reformed by the percentage. We use an embedding layer to generate the epoch embedding
. At last, the sample weight is obtained by a fully connected layer followed by a sigmoid layer. Let
denote the sample network with parameter
:
where the is the sample weight for and is the transform matrix in the fully connected layer. Then, the final loss function is
The sample network is to decide the strategy of the data utilization. To pre-train the sample network, we can pre-define the strategy based on previous observations, such as de-weighting the nodes with higher prediction errors, since these nodes are more likely to be mislabeled. Instead of pre-defining the strategy, the better way is to learn the strategy from the training data. In this case, we can recheck a small set of data and get a clean dataset. Then, we construct a noisy dataset to train a GNNs model and collect the corresponding training features. We set the weight of the correct node as 1 and the mislabeled node as 0 to pre-train the sample network. It is worth noting that only a small set of the clean dataset is enough and the pre-trained model in one dataset can be transferred to co-train with GNNs on another dataset.
In practice, the sample network is not introduced at the beginning of the training to avoid introducing more noise from the training statistics of the relatively unstable GNNs model. Meanwhile, we select part of the data for model training. In the beginning, we randomly drop out part of the samples. After some training steps, we choose the samples with higher weight and set the remaining as 0. The procedure of training GNNs with the sample network is shown in Algorithm 1.
4.4 Time complexity
In this part, we analyze the time complexity of Soft-GNN. It takes additional training time on the collection of training statistics and the computation of the sample network. For simplicity, the hidden units in GNN and the sample network are the same, denoted by . To get the training statistics as the input of the sample network, the complexity is and linear in the number of labeled nodes since and are usually small. For the sample network, the complexity of the LSTM layer and fully connected layer is , where is the length of the input sequence. Thus, the total extra time complexity is . For GCN with 2 layers, the complexity is and linear in the number of edges . Then, the complexity of Soft-GNN is . Because and , the final time complexity is and as same as that of GCN. In summary, the additional time cost is tolerable compared to the time consumption of GNN itself. We also present the comparison results of average training time in Section 5.4.
5 Evaluations
For evaluations, we compare Soft-GNN with state-of-the-art baselines on five datasets. We also explore the effectiveness of the proposed method under different noise rates.
5.1 Experimental setups
5.1.1 Datasets
We use five benchmark datasets for evaluations: Cora, Citeseer [
45], Pubmed [
46], Wiki-cs [
47], and
Amazon computers (A-computers) [
48]. The details are presented in Tab.1.
Tab.1 The statistics of five datasets |
Dataset | Nodes | Edges | Class | train / val / test |
Cora | 2,810 | 15,926 | 7 | 140 / 500 / 1310 |
Citeseer | 2,110 | 7,336 | 6 | 120 / 500 / 610 |
Pubmed | 19,717 | 88,648 | 3 | 60 / 500 / 1000 |
Wiki-cs | 11,311 | 431,108 | 10 | 500 / 500 / 1000 |
A-Computers | 13,381 | 491,722 | 10 | 500 / 500 / 1000 |
5.1.2 Noise types
Since there is no public noisy dataset for node classification, we generate the noisy dataset with noise rate. Generally, there are two types of noise: uniform noise and pair noise. The former is that the label is corrupted to other labels with a uniform probability. The latter is that the label flips to one pre-defined similar class with a certain probability. For example, if there are three classes and the probability of the label corruption is 0.2, the class transfer matrix and are as follows:
5.1.3 Baselines
For comparison, we select several baselines, which fall into two categories: designed for image data and graph data. There are four baselines widely used in image classification to deal with the problem of training with noisy labels. For experiments, we replace the prediction model with the GNNs model.
● Decoupling [
33]: The framework contains two independent prediction models and the models are updated when there are disagreements in predictions. We randomly choose one model to predict the test.
● Co-teaching [
34]: Two prediction models are adopted and one model is updated by the samples screened by the other model. Similarly, we use one model to generate predictions in the test phase.
● Trunc
[
49]: The proposed loss function
is a generalized cross entropy. We use the truncated version in the paper and the sample loss smaller than the pre-defined threshold is pruned.
● SACE [
10]: It utilizes the middle output of the model to dynamically correct the labels and re-weight samples during the training process.
The following four state-of-the-art baselines are designed for training GNNs with noisy labels.
● UnionNET [
5]: The main idea is to re-weight samples and correct labels by the label aggregation of the support nodes set.
● SuLD-GCN [
6]: The framework focuses on how to correct labels. It adds the superclass node to the original graph and trains GNNs to correct the label with high prediction probability.
● NRGNN [
8]: It adds an edge between the labeled nodes and unlabeled nodes by an edge predictor and trains the GNNs model with an edge predictor.
● CLNode [
37]: It introduces curriculum learning by sequentially training models on increasingly challenging subsets of the dataset.
In the experiments, we choose two fundamental and representative graph neural network models, GCN [
9] and APPNP [
36], as the basic model for investigating the performance of the proposed framework. According to the paper [
50], GCN and APPNP represent two representative types of GNN models with different propagation mechanisms. Specifically, the former has the following propagation mechanism which conducts linear transformation and non-linearity activation repeatedly throughout multiple layers. The latter utilizes a propagation mechanism derived from personalized PageRank and separates the feature transformation from the aggregation process.
5.1.4 Experimental settings
For data splitting, we randomly choose 1500 nodes as the visible set. The training set and validation set are selected from the visible set. For the test set, if the number of remaining nodes is less than the number of the visible set, we use all the remaining nodes as the test set. If the contrary happens, we randomly select 1000 nodes as the test set. The number of layers is set to 2 and the number of hidden units is 64. The range of the learning rate is from 0.1 to 0.01 and the dropout rate is chosen from 0.2 to 0.6. In addition, the range of and are [1,5] and [10,50]. The epoch to introduce the sample network is set from 50 to 100 and the start epoch for other baselines if needed is set as the same. For baselines, the same parameters are adopted if mentioned in the original paper. For reliable results, we use five different data splits and conduct experiments 10 times for each splitting. All the experiments are tested with one Tesla P100 GPU.
5.2 Performance comparison
We choose two popular GNNs as the basic model and compare Soft-GNN with the baselines under two types of noise. The comparison results on the node classification task under noise rate=0.2 are shown in Tab.2. It can be seen from the table that Soft-GNN performs better than the basic GNNs and achieves the highest prediction accuracy in most cases. The increase of the accuracy is around 2% and over 3% in some cases.
Tab.2 The Micro-F1 (%) under uniform and pair noise rate=0.2. We present the results and 95% confidence interval |
Dataset | Noise type | Backbone | Method |
Basic | Co-teaching | Decoupling | Trunc | SACE | UnionNET | SuLD-GNN | NRGNN | CLNode | ours |
Cora | Uniform | GCN | 73.02±0.53 | 61.87±1.05 | 72.66±0.44 | 72.62±0.49 | 69.98±0.43 | 71.32±0.72 | 73.10±0.54 | 74.43±0.52 | 75.49±0.68 | 76.25±0.57 |
APPNP | 77.64±0.77 | 64.24±1.28 | 77.69±0.76 | 74.13±1.04 | 75.02±0.95 | 73.33±1.30 | 78.16±0.77 | − | 77.85±0.71 | 78.70±0.71 |
Pair | GCN | 78.43±0.31 | 71.15±0.51 | 78.19±0.31 | 78.48±0.46 | 78.38±0.36 | 78.20±0.36 | 79.16±0.26 | 78.69±0.59 | 79.07±0.46 | 79.49±0.48 |
APPNP | 81.60±1.01 | 76.29±1.63 | 81.27±0.84 | 81.46±0.82 | 81.91±0.89 | 81.51±0.65 | 81.26±0.69 | − | 82.68±0.42 | 83.66±0.37 |
Citeseer | Uniform | GCN | 65.27±0.72 | 59.03±0.96 | 64.97±0.79 | 64.84±0.80 | 66.08±0.81 | 66.07±1.07 | 65.07±0.60 | 67.22±0.75 | 67.91±0.67 | 68.40±0.70 |
APPNP | 71.05±0.68 | 69.75±0.73 | 70.50±0.77 | 71.00±0.59 | 71.86±0.60 | 70.75±0.71 | 71.00±0.79 | − | 71.64±0.72 | 71.97±0.74 |
Pair | GCN | 65.62±0.63 | 61.31±1.16 | 65.58±0.56 | 67.00±0.75 | 66.25±0.66 | 65.80±1.15 | 65.36±0.67 | 66.56±0.82 | 67.94±0.52 | 68.11±0.56 |
APPNP | 69.46±0.87 | 67.35±0.80 | 69.12±0.80 | 69.95±0.88 | 70.07±0.62 | 69.23±0.65 | 67.44±0.77 | − | 70.78±0.44 | 71.98±0.40 |
Pubmed | Uniform | GCN | 72.36±1.16 | 71.76±1.06 | 72.29±1.11 | 72.30±1.17 | 72.84±1.17 | 72.28±1.12 | 70.48±1.52 | 68.46±1.54 | 72.96±0.99 | 73.08±1.11 |
APPNP | 74.67±2.02 | 73.88±1.75 | 74.37±2.06 | 75.07±2.05 | 74.89±1.97 | 74.35±1.95 | 72.42±2.73 | − | 74.90±1.02 | 75.65±2.31 |
Pair | GCN | 70.21±0.84 | 66.68±1.85 | 70.09±0.86 | 70.81±1.01 | 71.42±1.07 | 69.85±0.87 | 69.22±0.89 | 68.75±0.77 | 70.23±1.08 | 71.02±0.87 |
APPNP | 74.92±1.07 | 74.39±1.36 | 74.16±1.25 | 75.49±1.14 | 75.40±1.09 | 74.63±1.08 | 74.42±1.04 | − | 75.92±0.85 | 76.13±1.02 |
Wiki-cs | Uniform | GCN | 56.99±0.53 | 58.24±0.91 | 56.96±0.51 | 49.84±1.04 | 51.66±0.98 | 48.94±1.73 | 53.87±1.10 | 55.23±1.13 | 56.75±0.67 | 58.04±0.63 |
APPNP | 67.55±0.73 | 67.71±0.66 | 66.33±0.77 | 62.61±1.23 | 64.45±0.86 | 60.42±2.19 | 56.03±2.29 | − | 67.29±0.65 | 68.81±0.91 |
Pair | GCN | 55.36±0.47 | 56.39±0.45 | 55.63±0.49 | 51.34±0.83 | 52.42±0.79 | 49.68±1.37 | 53.53±1.21 | 54.37±0.74 | 55.06±0.49 | 56.87±1.04 |
APPNP | 65.72±0.41 | 66.54±0.51 | 64.56±0.49 | 61.91±0.72 | 62.94±0.62 | 58.74±1.94 | 54.79±1.69 | − | 65.11±0.50 | 66.93±0.70 |
A-computers | Uniform | GCN | 58.01±1.18 | 58.27±3.00 | 57.27±3.34 | 52.76±1.82 | 53.36±2.25 | 48.63±2.18 | 59.01±2.74 | − | 56.95±1.58 | 59.89±1.32 |
APPNP | 64.20±1.73 | 64.83±3.22 | 65.19±3.71 | 56.03±5.00 | 56.88±2.84 | 52.74±3.89 | 65.32±3.90 | − | 64.98±2.23 | 65.44±1.85 |
Pair | GCN | 69.25±1.54 | 68.52±1.79 | 68.92±1.61 | 59.91±2.83 | 60.44±2.32 | 54.01±6.77 | 68.88±1.83 | − | 68.34±1.59 | 70.43±2.21 |
APPNP | 73.98±0.75 | 71.36±1.79 | 73.30±0.88 | 62.28±2.32 | 63.71±1.88 | 63.20±3.36 | 74.51±0.74 | − | 73.80±0.99 | 75.11±0.73 |
For the method training with two GNNs models, the performance of Co-teaching varies among different datasets compared with Decoupling. There are stable improvements of Co-teaching in the prediction accuracy on Wiki-cs and A-computers, while it suffers severe performance degradation on the other three datasets. By contrast, the performance of some baseline is unsatisfactory on relatively larger datasets such as Wiki-cs and A-computers. For example, Trunc and SACE perform worse on Wiki-cs and A-computers than the remaining three datasets. It also happens when training the model with UnionNET, SuLD-GNN, and CLNode. However, it observes that our method continues to demonstrate considerable performance improvement in complex network structures. For Wiki-cs and A-computers, it is noticed that there are more nodes, edges, and classes. In this case, Soft-GNN achieves the best results compared to other methods, with a performance improvement of approximately 2%. For SACE, the label is corrected with a soft label during the training period, while for UnionNET and SuLD-GNN, the label correction module is included. Then, it is more likely to introduce additional noise with the increase of classes and training samples, and the performance decreases.
As for NRGNN, we discover that it performs better on the relatively small datasets, Cora and Citeseer, and is not as good as basic GNNs on Pubmed and Wiki-cs. In the training stage, NRGNN would change the graph topology and add the edge between labeled nodes and unlabeled nodes, which are more probably from the same class. Thus, it is getting harder to make positive changes in topology in the face of increasing nodes or classes. Moreover, the NRGNN is more susceptible to suffering from out-of-memory error during training on datasets with complicated graph structures, such as Amazon computers.
Compared to GCN and APPNP, it is found that APPNP is more tolerant to label noise than GCN. It is observed that there is a significant increase in the prediction accuracy by around 3% to 10%, compared to GCN. Meanwhile, we find a similar situation in that the improvements of Soft-GNN and baselines on GCN are more obvious than those on APPNP. Since APPNP decouples the model propagation and prediction steps, it also blocks the propagation of the label noise and reduces the impact of the label noise during training. Therefore, APPNP performs much better than GCN.
In summary, the results verify the effectiveness of Soft-GNN by finding optimal strategies for data utilization during training.
5.3 Performance analysis
5.3.1 The level of noise rate
In this part, we observe the performance of GCN under different levels of noise rates. The results on Cora and Citeseer are presented in Fig.4. Overall, Soft-GNN performs better than the basic GNNs and the improvements are more clear with a higher noise rate. Compared with other baselines, the proposed method is the best and even improves the performance of GNNs without label noise. Compared with the baselines, Soft-GNN realizes flexible data utilization through a sample network and avoids introducing extra noise.
Fig.4 The results under different noise types and rates on Cora and Citeseer. (a) Uniform; (b) Pair; (c) Uniform; (d) Pair |
Full size|PPT slide
In general, the methods designed for GNNs have better performance than methods without considering the dependency of nodes. For the baselines derived from the image field, co-teaching is the worst and performs worse than the basic GNNs in most cases. SACE is more or less like GCN and is observed to perform better under more label noise on Citeseer. As for the framework designed for GNNs, SuLD-GNN is minimally effective and NRGNN performs better than SuLD-GNN in some cases. SuLD-GNN needs label correction, while the effect of label correction depends on the exactness of the modified labels, and thus it does not always work. Compared with SuLD-GNN, SACE modifies the label with a soft label and can reduce the risks of making mistakes in some cases. For example, SACE performs much better than SuLD-GNN with uniform noise on Citeseer in Fig.4(c). Similarly, NRGNN adjusts the topology during training. It can be seen from Fig.4(a) that NRGNN works under the situation of severe label noise, while it is counterproductive to GNNs when the noise rate is small or equal to 0. A similar trend can be seen in Fig.4(c). Therefore, the strategy of topology adjustment works up to a point and introduces extra noise.
In addition, we find that there is a difference between the two types of noise in the impact on model performance, even though the overall trend is similar. Compared with pair noise, GCN is more sensitive to uniform noise, while it is relatively difficult to improve the performance of GNNs under pair noise.
5.3.2 The performance of mislabeled node
In this part, we explore how the proposed method works and estimate the performance of mislabeled nodes and their neighborhoods. We test Soft-GNN based on GCN under the noise rate=0.4. Fig.5 illustrates the performance gains of Soft-GNN over GCN about Micro-F1 and Macro-F1, which is evaluated with the true labels. The neighborhood of mislabeled nodes consists of top-5 structural neighboring nodes. It can be seen from the figure that Soft-GNN improves the prediction accuracy of GCN and the performances on the mislabeled node, and its neighborhood are also increased. It indicates that the model trained by Soft-GNN is more robust to label noise and mitigates the propagation of noise to the corresponding neighborhood in GNNs at the same time. We also notice that the improvements on mislabeled nodes are less than that on their neighborhood, which is more obvious in the dataset with pair noise. Then, we have a good look at Fig.5(a) and Fig.5(b). It turns out that the increase of macro-F1 is much higher than that of micro-F1 among mislabeled nodes and their neighborhoods. It confirms that the proposed method keeps the mislabeled nodes and their neighboring nodes from the confusion of label noise and learns a more robust GNNs model.
Fig.5 The performance gains over GCN on Citeseer under noise rate=0.4. (a) Micro-F1; (b) Macro-F1 |
Full size|PPT slide
5.4 Time and memory comparison
In this section, we make a comparison of training time and memory efficiency among baselines and our methods. We estimate the average running time for 200 epochs and the corresponding GPU memory usage. Tab.3 presents the results of the ratio to basic GCN with 64 hidden units. According to the ratio, the training time of Soft-GNN increases, and the memory usage is almost the same as GCN. For Soft-GNN, the time increase is from the weight generation step and is positively related to the number of training samples. For example, the training time on Wiki-cs is more than that on Citeseer and Pubmed datasets.
Tab.3 The comparison of time and memory efficiency over GCN with 64 hidden units |
Ratio to GCN | Time | | GPU Memory |
Citeseer | Pubmed | Wiki-cs | Citeseer | Pubmed | Wiki-cs |
Co-teaching | 1.38 | 1.35 | 1.31 | | 1.03 | 1.05 | 1.02 |
Trunc | 0.50 | 0.44 | 0.52 | – | – | – |
UnionNET | 0.72 | 0.71 | 0.73 | – | – | – |
NRGNN | 16.18 | 70.79 | 61.09 | 1.93 | 1.91 | 9.35 |
Soft-GNN | 2.00 | 1.85 | 3.26 | 1.04 | 1.02 | – |
The other baselines, except for NRGNN, take up the same amount of time and memory during training. We notice that Truc and UnionNET spend less time than GCN. This is because of the truncation of partial samples and the time reduction of backward propagation. For NRGNN, it can be seen that NRGNN needs much more time for training, for example, 16 times on Citeseer. Furthermore, the additional time consumption increases with the number of nodes and edges. It is observed that training NRGNN takes a significant amount of time on Pubmed and Wiki-cs datasets since there are more nodes and edges on Pubmed and Wiki-cs datasets. As for the memory usage of NRGNN, the extra usage is mainly from the topology adjustment and depends on the number of edges. The adjustment of graph structure heavily engages time and memory. In summary, Soft-GNN achieves a steady performance improvement and only increases limited time and memory consumption.
5.5 Hyper-parameter analysis
We explore the effect of two hyper-parameters, the number of the structural neighboring nodes and the length of the node similarity sequence . The range of is [1,5] and the value of is chosen from [5,10] with a step of 1 and [10,50] with a step of 10. As shown in Fig.6, we test the prediction accuracy of Soft-GNN on Cora with uniform noise and the noise rate is set to 0.3. With the number increasing, the performance drops in general. In that case, more nodes are added to the structural neighborhood and the newly-selected nodes are more likely to be from other classes. We estimate the ratio of the node pair from the same class between the nodes and its structural neighborhood, denoted by pair homophily. Generally, the larger value of usually leads to the lower pair homophily. Thus, the is set to a relatively small value. It is also found that too small or big value of is not a good choice. When the is smaller, it is better to set less than 20. The appropriate combination is that and are from {1,2,3} and {5,20}. For evaluations, we set =2 and =10, except for =20 on the Wiki-cs dataset.
Fig.6 The performance comparison with different hyper-parameters: and on Cora under noise rate=0.3 with GCN |
Full size|PPT slide
6 Conclusion
In the paper, we analyze the impact of mislabeled nodes when training GNNs under label noise. It is found that the label noise causes the prediction deviation and disrupts the original relationships with neighboring nodes. Based on the observations, we propose a simple yet effective framework for training robust GNNs, denoted by Soft-GNN. The introduced sample network utilizes the observed training deviations and outputs sample weights to realize the self-adaptive data utilization. The experimental results show that Soft-GNN can learn the robust GNNs model under different noise types and rates. Moreover, it is still better when the data is noise-free. In addition, Soft-GNN improves the performance of mislabeled nodes and their neighboring nodes. We focus on the label noise for the data corruption at the node level, and further study will pay close attention to other types of data noise.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}