Soft-GNN: towards robust graph neural networks via self-adaptive data utilization

Yao WU, Hong HUANG, Yu SONG, Hai JIN

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (4) : 194311.

PDF(4430 KB)
Front. Comput. Sci. All Journals
PDF(4430 KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (4) : 194311. DOI: 10.1007/s11704-024-3575-5
Excellent Young Computer Scientists Forum
RESEARCH ARTICLE

Soft-GNN: towards robust graph neural networks via self-adaptive data utilization

Author information +
History +

Abstract

Graph neural networks (GNNs) have gained traction and have been applied to various graph-based data analysis tasks due to their high performance. However, a major concern is their robustness, particularly when faced with graph data that has been deliberately or accidentally polluted with noise. This presents a challenge in learning robust GNNs under noisy conditions. To address this issue, we propose a novel framework called Soft-GNN, which mitigates the influence of label noise by adapting the data utilized in training. Our approach employs a dynamic data utilization strategy that estimates adaptive weights based on prediction deviation, local deviation, and global deviation. By better utilizing significant training samples and reducing the impact of label noise through dynamic data selection, GNNs are trained to be more robust. We evaluate the performance, robustness, generality, and complexity of our model on five real-world datasets, and our experimental results demonstrate the superiority of our approach over existing methods.

Graphical abstract

Keywords

graph neural networks / node classification / label noise / robustness

Cite this article

Download citation ▾
Yao WU, Hong HUANG, Yu SONG, Hai JIN. Soft-GNN: towards robust graph neural networks via self-adaptive data utilization. Front. Comput. Sci., 2025, 19(4): 194311 https://doi.org/10.1007/s11704-024-3575-5

1 Introduction

Due to the effectiveness of the graph neural networks (GNNs), it gets to be the first choice for graph-based problems. In practice, we often face the issues of agnostic graph data pollution, such as deliberate adversarial attacks or annotation mistakes by accident [1]. For example, we often use crowd-sourcing platforms to annotate data. It is inevitable for the annotator to make mistakes and introduce label noise [2]. In this case, GNNs is confronted with the problem of over-fitting to noisy graph data and suffers from severe performance degradation [3,4]. In this paper, we focus on mitigating the negative influence of potential label noise among graph data.
To train robust graph neural networks in the presence of label noise, researchers have developed strategies such as sample re-weighting and label correction. Meanwhile, given the interdependence of samples or nodes in graph data, it is important to consider the effects imposed by the underlying topology. To this end, recent works such as [5] and [6] incorporate neighborhood information during the label correction and sample re-weighting processes. Other approaches aim to protect representations of nodes from the same class against label noise. For example, [7] constructs positive node pairs, while [8] adjusts the graph topology by adding edges between labeled and unlabeled nodes. However, it is worth noting that these methods have the potential to introduce additional noise by revising the already noisy graph data. In this case, it may lead to a performance drop and even result in more incorrect predictions.
A feasible alternative is to select the right data for model training instead of revising noisy graph data and introducing extra noise. The aforementioned methods still train GNNs with all the observed labels and try not to miss the right part of the dataset. Let us think about it another way. To learn a better model, is it equally effective to utilize the right subset of labeled nodes covering as less noise as possible? Thus, we estimate the impact of incorrect labels in training GCN [9] for the node classification task. As can be seen from Fig.1(a), the prediction accuracy on the test dataset continues to drop down with the increase of noise rate, while if we train GCN on the subset of the training dataset by masking all incorrect labels, there is only slight degradation in the performance accordingly.
Fig.1 (a) The performance of GCN under different levels of noise rate on Cora and Citeseer. The “-NF” dataset is derived from the noisy dataset by eliminating the noise labels; (b) the loss curve of GCN trained on Cora under noise rate=0.3

Full size|PPT slide

To this end, we look deeper into the training of GNNs on noisy graph data and find that there are differences between mislabeled nodes and correct nodes in the training process. It is observed, in Fig.1(b), that the average loss on mislabeled nodes is larger than that on correct nodes. We notice that mislabeled nodes cause the prediction deviation from the correct nodes and the deviation tends to increase as the model trains. A similar phenomenon has been discovered in the image field [10]. Compared to image data, there are connections between nodes in graph data. Meanwhile, graph neural networks update node representations based on the graph structure. So we further revisit the training differences between mislabeled nodes and correctly labeled nodes from the perspective of node neighborhoods. Specifically, for nodes in the graph, we analyze from both local and global structural perspectives. 1) Local view. We measure the local relative relationship by estimating the distribution distance of node representations between the node and its surrounding neighbors. 2) Global view. We estimate the global relative relationship by measuring the change in the relative position ranking of the node compared to similar nodes in the original feature space. Fig.2(a) and Fig.2(b) respectively show the changes in node-related KL-divergence and NDCG during the training process. There are obvious deviations between mislabeled nodes and correctly labeled nodes from both local and global views. To be consistent, we further call these two kinds of deviations as local deviation and global deviation, respectively.
Fig.2 (a) The KL-divergence of the prediction distributions between nodes and local surrounding nodes; (b) the NDCG@10 of ranking similar nodes based on feature space in the output space. We get the results on Cora with GCN when the noise rate is 0.3

Full size|PPT slide

Based on the above analysis, we discover three types of deviations between mislabeled nodes and correct nodes. These training statistic features uncover the potential impact of label noise from the views of prediction as well as the local and global neighborhood, which can be further utilized for data selection.
Thus, we propose a novel framework to reduce the influence of label noise with self-adaptive data utilization. In the training procedure, we introduce a self-adaptive sample network and select the subset of labeled nodes based on current training states. Specifically, it learns the sample weight by considering prediction, local, and global deviation and prunes labeled nodes according to the output weights. We further balance selected labeled nodes with the corresponding weights and assign learned weights to selected nodes with loss correction for model training. What is remarkable is that the sample network realizes dynamic data selection and can avoid selection error accumulation. Moreover, it can design a flexible training mechanism. For example, the more low-weighting nodes are selected as the training processing and it is effective to avoid the model over-fitting.
In summary, the contributions are three folds:
● We analyze the training states of GNNs trained with noisy data and find that there are noticeable deviations from training states for mislabeled nodes.
● We propose a novel framework to introduce a self-adaptive sample network for dynamic data selection and mitigate the negative impact of label noise, denoted by Soft-GNN.
● We conduct lots of experiments to verify the effectiveness of the proposed framework and give new insight into the improvements.
The rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 introduces the preliminary and addresses the challenge. Next, we provide the solution and illustrate the proposed framework in Section 4. Then, we present the details of the experiments and analyze the experimental results in Section 5. In the end, we conclude and discuss the future work in Section 6.

2 Related works

2.1 Label noise

The label noise is a common problem in the classification task, which might lead to serious consequences, such as performance degradation or model corruption [11]. The incorrect label is derived from the inaccurate manual or machine annotation, such as in medical data [12,13], image data [14,15], and language corups [16]. How to learn with noise labels is getting the interest of researchers. Some studies focus on the noise-robust model, such as SVM [17] and decision tree [18]. Other studies pay more attention to the learning algorithm with noisy label [19]. The typical learning paradigm is to distinguish the noisy label and reduce the effect of the label noise during training. How to learn robust models under label noise is widely studied in the image field. The methods fall into two categories: data-based and model-based. The former filters the noisy data from the perspectives of the data generation, such as the Bayesian approach [20] or cluster-based method [21]. These methods can tell the possible noisy data first and learn the model on the filtered clean dataset [22].
The model-based approaches recognize the potential incorrect label according to the model predictions [23-25] and can be applied in training deep neural networks under label noise in an end-to-end manner. There are three main categories of the training strategies: sample re-weighting [26-30], label correction [10,31,32], and cooperative training [33-35]. Sample re-weighting treats the noisy sample and correct sample differently and reassigns the weight of samples. [26] adopts the Bayesian-based approach for re-weighting, while [27,28] introduce a neural network to learn the weight for each sample. In addition, [29,30] abandon the potential noisy sample during training. The method of label correction revises the given labels. [10] proposes a method to update the label with the prediction to soft-label and reassign the weight of the sample at the same time. [31] introduces a meta-model as the label correction network. Cooperative training trains two models to supervise the training of each other. Decoupling [33] updates the model when the two models make different predictions and co-teaching [34] updates one model with the samples with a small loss from the other model.

2.2 GNNs with label noise

Recently, graph neural networks have achieved promising results for graph-based tasks, such as GCN [9] and APPNP [36]. Generally, the given labels are treated as true labels when training GNNs and there is little attention to training GNNs with label noise. For graph-level classification, some strategies in the image field can be directly used. For example, [3] adopts loss correction by estimating the correction matrix. For the node-level classification task, the nodes are connected and dependent on graph data. Meanwhile, GNNs include feature propagation and aggregation, and the influence of label noise could spread to other samples through the edge. Because of this, researchers design personalized strategies for training GNNs under the label noise [5-8,37,38]. UnionNET [5] constructs a support node set for each training node and proposes a uniform framework with sample re-weighting and label correction. [6] corrects the label by pre-training the model and utilize the revised labels to train the GNNs model. Pi-gnn [7] constructs node pairs with positive pairwise interaction among all nodes and encloses the embedding of node pairs against label noise. NRGNN [8] introduces an edge predictor and adds edges between the labeled node and unlabeled node to reduce the effect of label noise. CLNode [37] explores curriculum learning for node classification, a technique that improves the performance of node classification by sequentially training models on increasingly challenging subsets of the dataset. MTS-GNN [38] introduces a multi-teacher self-training strategy for semi-supervised node classification with noisy labels, enhancing classification performance by utilizing knowledge from multiple teachers and noisy labels.
In this paper, we focus on the node-level classification under label noise. We discover some deviations between the mislabeled nodes and clean nodes, including prediction deviation, local deviation, and global deviation. Based on these observations, we introduce a sample network by using the above-mentioned training statistics and realize the self-adaptive data utilization during training.
Compared with the state-of-the-art baselines, the proposed method does not correct the labels or adjust the graph topology. The label correction has a margin of error and the error might accumulate as the training continues. Likewise, it takes much time to adjust the graph topology and the effect of the adjustment is uncertain. By contrast, the proposed framework shows flexibility in sample selection and can tailor an appropriate learning strategy to the current training state. It can avoid error accumulation promptly and achieves a steady improvement in performance with less extra time overhead.

3 Preliminary and challenge

3.1 Preliminary

Given a graph G=(V,E,A), V is the set of nodes, E is the edge set, and A represents the adjacency matrix. For a node pair (u,v), Au,v=1 if (u,v)E, and 0 otherwise. The graph neural networks are designed to solve the graph-based analysis tasks and achieve promising results. For the node classification task, it is usual that the graph is partially labeled and the labels might be noisy. We denote the GNNs model as f with parameters θf and the output of GNNs is derived as:
Z=f(A,X;θf),
where X is the feature matrix. Z is the output of the final GNNs layer and the prediction is Y^=Softmax(Z). Since partial nodes are labeled, the GNNs model is trained in a semi-supervised way. Thus, the objective is:
argmaxθfi=1|Vtr|logP(y¯i|xi;θf),
where Vtr is the set of training nodes and y¯iYobserved is the observed label.

3.2 Challenge

In this paper, our goal is to eliminate the negative influence of noisy labels during the training process of GNNs. We consider the situation that the observed labels may be mislabeled, which means YobservedYtrue and Ytrue is the ground truth. To better illustrate the challenge but still capture the essence of graph convolutions, we follow [4] to perform a linearization of GNNs. Thus, taking a two-layer GNN as an example, we replace the non-linearity functions σ() with simple linear functions, that is:
Y=softmax(A,σ(A,X,W0),W1)Y=softmax(A2XW).
Since W0 and W1 are free learnable weights, they can be absorbed in a single weight W. Typically, W can be learned through back propagation using the observed labels in a semi-supervised manner. However, due to the noise of observed labels, which can be viewed as perturbations of the real labels, the distributions of observed labels and real labels may be different, i.e., it holds:
YtrueYobservedWtrueWobserved.
Equation (4) tells us the shift of label distribution causes a synchronous shift of the model parameters, we call this phenomenon as distribution shift. Given this, we aim to use variable balancing [39] to joint learning a global sample weight matrix P and optimizing Wobserved. Referring to [39], we consider the stability of predictions made by the network when trained on noisy graph data. However, instead of directly working with the stability of predictions, we focus on the stability of the graph representation learned by the network. This is where the sample weight matrix P comes into play. Thus, we have the formulation as follow:
PYobserved=softmax(A2XWobservedP),s.t.P0.
According to Eq. (5), P is a positive definite matrix. If PYobservedYtrue, WobservedPWtrue. In other words, the sample weights P guarantees that the model is robust to noisy labels because it makes it possible for the GNNs to prevent distribution shift. Note that the label noise is agnostic, in practice, the Ytrue is not accessible. Moreover, the noise distribution depends on various factors, and directly learning the distribution is unrealistic. To this end, the biggest challenge to learning a robust GNN is to estimate the sample weight matrix P.
It is more likely to learn a better model when the label distribution is closer to the true distribution. According to Fig.1(a), it has a better performance by directly masking the incorrect labels when training. Then, there is no need to learn P to guarantee that PYobserved is equal to Ytrue. Researches [40,41] suggest that there is no big difference in the performance of the model learned on partial important samples and full dataset. [41] proposes a score function to identify important samples and finds that a pruned dataset is also able to reduce the effect of the label noise. Thus, It provides an idea of optional data utilization to mitigate the data distribution shift caused by label noise and guarantee the GNNs model performance.

4 Method

4.1 Overview

We propose a novel framework, named Soft-GNN, to introduce a sample network to select data and mitigate the influence of label noise with self-adaptive data utilization. To this end, we analyze the training states of mislabeled nodes and discover that the mislabeled nodes deviate from other labeled nodes. There are three levels of deviation: prediction, local, and global deviation. Based on that, we estimate the sample weights with the training deviation features for data selection and re-weight samples for model learning. The architecture of Soft-GNN is illustrated in Fig.3.
Fig.3 The overall framework of Soft-GNN

Full size|PPT slide

4.2 The deviation by label noise

4.2.1 Prediction deviation

It is discovered that the mislabeled sample is difficult to learn by the deep neural networks in previous studies [10]. Fig.1(b) witnesses a similar phenomenon on GNNs. The loss on clean nodes falls more quickly compared to that on mislabeled nodes. As the process of training, the prediction gap becomes more pronounced. Thus, the mislabeled nodes are more likely to be the hard samples when training GNNs.
Definition 4.1 For each labeled node vi, the prediction deviation is derived as the cross entropy between the model prediction and the label, represented by PE(vi). The calculation formula is as follow:
PE(vi)=Cyi,clog(y^i,c),
where C is the number of classes and y^i,c is the prediction probability of vi belonging to class c.
For graph data, the nodes are connected and dependent. We further explore whether mislabeled nodes disrupt the inherent relationships with neighboring nodes and cause deviations in output space. Theoretically, GNNs follow the paradigm of feature propagation and aggregation, and lead to feature smoothing. It means the node can be reconstructed by the neighboring nodes. It is also verified that feature smoothing theoretically guarantees label smoothing from the view of label propagation [42]. In other words, there is consistency in the reconstruction relations between feature space and output space generated by the learned model. Then, the label noise would change the relations with neighboring nodes in the output space. It is found that there are local deviation and global deviation for mislabeled nodes.

4.2.2 Local deviation

From the topology of the graph, it can be seen from Fig.2(a) that the mislabeled nodes are far away from their surroundings in output space.
To estimate the prediction divergence between the node and its surrounding nodes, the first is to generate the neighborhood embedding. If we consider all surrounding nodes, the divergence may not be introduced by the label noise but by the class variance among the surrounding nodes. Then, we select the nodes with high structural dependency as the structural neighborhood to reduce the influence of class variance. The higher the structural dependency between two nodes is, the more common surrounding nodes are. We utilize the mutual information to estimate the structural dependency [43]. Let S denote a set of the variables for nodes in the graph and Si represent randomly selecting one node from the graph for node vi. Then, N(vi) denotes the neighborhood of node vi and N(vi)¯ is the complementary set of N(vi). In this case, Si has two possible values {0,1}, where 0 indicates the selected node is from N(vi)¯ and 1 indicates the selected node belongs to N(vi). Thus, p(Si=0)=|N(vi)¯||V| and p(Si=1)=|N(vi)||V|.
Hence, the structural dependency between vi and vj is estimated by the mutual information between Si and Sj, computed by
SD(Si,Sj)=SiSjp(Si,Sj)logp(Si,Sj)p(Si)p(Sj),
where p(Si,Sj) is the joint distribution. Considering that both Si and Sj have two possible values, (Si,Sj) have four combinations. The probabilities are as follows:
p(Si=0,Sj=0)=N(vi)¯N(vj)¯|V|,p(Si=0,Sj=1)=N(vi)¯N(vj)|V|,p(Si=1,Sj=0)=N(vi)N(vj)¯|V|,p(Si=1,Sj=1)=N(vi)N(vj)|V|.
For a training node vi, we can compute the structural dependency between vi and each neighboring node vjN(vi). The top-k nodes are selected as the structural neighborhood based on the value of structural dependency, denoted by Nsk(vi). Then, we generate the neighborhood embedding by aggregating the representations of nodes from Nsk(vi). For example, we use the mean aggregator to get the neighborhood embedding and the class distribution is derived as
mi=Softmax(1|Nsk(vi)|vjNsk(vi)zj).
Definition 4.2 For each labeled node vi, the local deviation is derived as node-to-neighborhood divergence, represented by ND(vi), and estimated by the KL-divergence between the class distributions of the node and its structural neighborhood. The calculation formula is as follow:
ND(vi)=KL(y^i||mi)=jCy^i,jlogy^i,jmi,j.

4.2.3 Global deviation

Furthermore, we explore whether mislabeled nodes deviate from neighboring nodes in a global view. Then, we expect to estimate the change of nodes’ relative placements from the feature space to the output space. Fig.2(b) outlines that the changes in the relative relations of mislabeled nodes are much greater and more unstable than those of clean nodes. Specifically, we have a look at the similarity ranking variance of the node list. Firstly, we get a node sequence for each sample node, where the nodes are ordered in the similarity in the feature space. Then, we reset the sequence by the similarity in the output space of the learned model and get the new ordered node list. The smaller value means the greater changes of the node’s relative relationship in output space as compared to the relations in feature space. Thus, there is a ranking deviation for mislabeled nodes.
For a target node vi, we compute the distance between nodes by the cosine similarity and get a node list RF(vi) in descending order of the distance to vi in feature space. For simplicity, the length of the node sequence is limited to q, and the similarity sequence {di,1F,..., di,qF} in feature space is denoted by DiF. To target node vi, the Discounted Cumulative Gain (DCG) of RiF is calculated by
DCGF(vi)=j=1qdi,jFlog2(j+1).
Accordingly, we can get a reordering node list RP(vi) for nodes from RiF and a new similarity sequence DiP ={di,1P,..., di,qP} in output space. Similarly, we can compute the DCG of RiP, denoted by DCGP(vi).
Definition 4.3 For each labeled node vi, the global deviation is estimated by the relative position change of node vi from the feature space to the output space, denoted as FP(vi). Then, the change of the node’s relative placements from the feature space to the output space is calculated as follow:
FP(vi)=DCGP(vi)DCGF(vi).

4.3 Self-adaptive sample network

The aforementioned training deviation could be used as a feature to tell the potential mislabeled nodes from the noisy data. Thus, we introduce a sample network to select data and realize the adaptive data utilization to guide the training procedure of GNNs. Based on the previous training deviation, the sample network outputs weight for each node and suggests the optimal data utilization for the next training step.
At each training step t, we can get the corresponding deviation features bi(t)=[PE(vi),ND(vi),FP(vi)] for viVtr. We choose the sequence of training features in the time window [tT,t] as the input of the sample network, denoted by bi={bitT,...,bit}. Furthermore, we expect to obtain the relative position of each node among the training set on each feature dimension. Inspired by [27], we get the diff(bi), where diff() is to compute the difference between the original value and the percentile. To normalize each training feature and account for its relative importance, we independently calculate its percentile based on the data distribution. This approach helps to equalize the influence of different features and improve the overall performance of the model. Then, the training features for node vi is the concatenation of bi and diff(bi).
To aggregate information across the T time windows, we adopt recurrent neural networks, such as LSTM [44]. Meanwhile, we add another feature epoch to indicate the training process, and the value is reformed by the percentage. We use an embedding layer to generate the epoch embedding e. At last, the sample weight is obtained by a fully connected layer followed by a sigmoid layer. Let g denote the sample network with parameter θg:
wi=g(bi,e;θg)=Sigmoid(Ws[lstm(bi;diff(bi)),e]),
where the wi is the sample weight for vi and Ws is the transform matrix in the fully connected layer. Then, the final loss function is
L=1|Vtr|iwijyi,jlog(y^i,j).
The sample network is to decide the strategy of the data utilization. To pre-train the sample network, we can pre-define the strategy based on previous observations, such as de-weighting the nodes with higher prediction errors, since these nodes are more likely to be mislabeled. Instead of pre-defining the strategy, the better way is to learn the strategy from the training data. In this case, we can recheck a small set of data and get a clean dataset. Then, we construct a noisy dataset to train a GNNs model and collect the corresponding training features. We set the weight of the correct node as 1 and the mislabeled node as 0 to pre-train the sample network. It is worth noting that only a small set of the clean dataset is enough and the pre-trained model in one dataset can be transferred to co-train with GNNs on another dataset.
In practice, the sample network is not introduced at the beginning of the training to avoid introducing more noise from the training statistics of the relatively unstable GNNs model. Meanwhile, we select part of the data for model training. In the beginning, we randomly drop out part of the samples. After some training steps, we choose the samples with higher weight and set the remaining as 0. The procedure of training GNNs with the sample network is shown in Algorithm 1.

Full size|PPT slide

4.4 Time complexity

In this part, we analyze the time complexity of Soft-GNN. It takes additional training time on the collection of training statistics and the computation of the sample network. For simplicity, the hidden units in GNN and the sample network are the same, denoted by d. To get the training statistics as the input of the sample network, the complexity is O((k+q)|Vtr|) and linear in the number of labeled nodes since k and q are usually small. For the sample network, the complexity of the LSTM layer and fully connected layer is O(6T|Vtr|d), where T is the length of the input sequence. Thus, the total extra time complexity is O(|Vtr|d). For GCN with 2 layers, the complexity is O(2|E|d) and linear in the number of edges |E|. Then, the complexity of Soft-GNN is O((|Vtr|+2|E|)d). Because |Vtr||V| and |E||V|2, the final time complexity is O(|E|d) and as same as that of GCN. In summary, the additional time cost is tolerable compared to the time consumption of GNN itself. We also present the comparison results of average training time in Section 5.4.

5 Evaluations

For evaluations, we compare Soft-GNN with state-of-the-art baselines on five datasets. We also explore the effectiveness of the proposed method under different noise rates.

5.1 Experimental setups

5.1.1 Datasets

We use five benchmark datasets for evaluations: Cora, Citeseer [45], Pubmed [46], Wiki-cs [47], and Amazon computers (A-computers) [48]. The details are presented in Tab.1.
Tab.1 The statistics of five datasets
Dataset Nodes Edges Class train / val / test
Cora 2,810 15,926 7 140 / 500 / 1310
Citeseer 2,110 7,336 6 120 / 500 / 610
Pubmed 19,717 88,648 3 60 / 500 / 1000
Wiki-cs 11,311 431,108 10 500 / 500 / 1000
A-Computers 13,381 491,722 10 500 / 500 / 1000

5.1.2 Noise types

Since there is no public noisy dataset for node classification, we generate the noisy dataset with noise rate. Generally, there are two types of noise: uniform noise and pair noise. The former is that the label is corrupted to other labels with a uniform probability. The latter is that the label flips to one pre-defined similar class with a certain probability. For example, if there are three classes and the probability of the label corruption is 0.2, the class transfer matrix Tu and Tp are as follows:
Tu=[0.80.10.10.10.80.10.10.10.8];Tp=[0.800.20.20.8000.20.8].

5.1.3 Baselines

For comparison, we select several baselines, which fall into two categories: designed for image data and graph data. There are four baselines widely used in image classification to deal with the problem of training with noisy labels. For experiments, we replace the prediction model with the GNNs model.
● Decoupling [33]: The framework contains two independent prediction models and the models are updated when there are disagreements in predictions. We randomly choose one model to predict the test.
● Co-teaching [34]: Two prediction models are adopted and one model is updated by the samples screened by the other model. Similarly, we use one model to generate predictions in the test phase.
● Trunc Lq [49]: The proposed loss function Lq is a generalized cross entropy. We use the truncated version in the paper and the sample loss smaller than the pre-defined threshold is pruned.
● SACE [10]: It utilizes the middle output of the model to dynamically correct the labels and re-weight samples during the training process.
The following four state-of-the-art baselines are designed for training GNNs with noisy labels.
● UnionNET [5]: The main idea is to re-weight samples and correct labels by the label aggregation of the support nodes set.
● SuLD-GCN [6]: The framework focuses on how to correct labels. It adds the superclass node to the original graph and trains GNNs to correct the label with high prediction probability.
● NRGNN [8]: It adds an edge between the labeled nodes and unlabeled nodes by an edge predictor and trains the GNNs model with an edge predictor.
● CLNode [37]: It introduces curriculum learning by sequentially training models on increasingly challenging subsets of the dataset.
In the experiments, we choose two fundamental and representative graph neural network models, GCN [9] and APPNP [36], as the basic model for investigating the performance of the proposed framework. According to the paper [50], GCN and APPNP represent two representative types of GNN models with different propagation mechanisms. Specifically, the former has the following propagation mechanism which conducts linear transformation and non-linearity activation repeatedly throughout multiple layers. The latter utilizes a propagation mechanism derived from personalized PageRank and separates the feature transformation from the aggregation process.

5.1.4 Experimental settings

For data splitting, we randomly choose 1500 nodes as the visible set. The training set and validation set are selected from the visible set. For the test set, if the number of remaining nodes is less than the number of the visible set, we use all the remaining nodes as the test set. If the contrary happens, we randomly select 1000 nodes as the test set. The number of layers is set to 2 and the number of hidden units is 64. The range of the learning rate is from 0.1 to 0.01 and the dropout rate is chosen from 0.2 to 0.6. In addition, the range of k and q are [1,5] and [10,50]. The epoch to introduce the sample network is set from 50 to 100 and the start epoch for other baselines if needed is set as the same. For baselines, the same parameters are adopted if mentioned in the original paper. For reliable results, we use five different data splits and conduct experiments 10 times for each splitting. All the experiments are tested with one Tesla P100 GPU.

5.2 Performance comparison

We choose two popular GNNs as the basic model and compare Soft-GNN with the baselines under two types of noise. The comparison results on the node classification task under noise rate=0.2 are shown in Tab.2. It can be seen from the table that Soft-GNN performs better than the basic GNNs and achieves the highest prediction accuracy in most cases. The increase of the accuracy is around 2% and over 3% in some cases.
Tab.2 The Micro-F1 (%) under uniform and pair noise rate=0.2. We present the results and 95% confidence interval
Dataset Noise type Backbone Method
Basic Co-teaching Decoupling Trunc Lq SACE UnionNET SuLD-GNN NRGNN CLNode ours
Cora Uniform GCN 73.02±0.53 61.87±1.05 72.66±0.44 72.62±0.49 69.98±0.43 71.32±0.72 73.10±0.54 74.43±0.52 75.49±0.68 76.25±0.57
APPNP 77.64±0.77 64.24±1.28 77.69±0.76 74.13±1.04 75.02±0.95 73.33±1.30 78.16±0.77 77.85±0.71 78.70±0.71
Pair GCN 78.43±0.31 71.15±0.51 78.19±0.31 78.48±0.46 78.38±0.36 78.20±0.36 79.16±0.26 78.69±0.59 79.07±0.46 79.49±0.48
APPNP 81.60±1.01 76.29±1.63 81.27±0.84 81.46±0.82 81.91±0.89 81.51±0.65 81.26±0.69 82.68±0.42 83.66±0.37
Citeseer Uniform GCN 65.27±0.72 59.03±0.96 64.97±0.79 64.84±0.80 66.08±0.81 66.07±1.07 65.07±0.60 67.22±0.75 67.91±0.67 68.40±0.70
APPNP 71.05±0.68 69.75±0.73 70.50±0.77 71.00±0.59 71.86±0.60 70.75±0.71 71.00±0.79 71.64±0.72 71.97±0.74
Pair GCN 65.62±0.63 61.31±1.16 65.58±0.56 67.00±0.75 66.25±0.66 65.80±1.15 65.36±0.67 66.56±0.82 67.94±0.52 68.11±0.56
APPNP 69.46±0.87 67.35±0.80 69.12±0.80 69.95±0.88 70.07±0.62 69.23±0.65 67.44±0.77 70.78±0.44 71.98±0.40
Pubmed Uniform GCN 72.36±1.16 71.76±1.06 72.29±1.11 72.30±1.17 72.84±1.17 72.28±1.12 70.48±1.52 68.46±1.54 72.96±0.99 73.08±1.11
APPNP 74.67±2.02 73.88±1.75 74.37±2.06 75.07±2.05 74.89±1.97 74.35±1.95 72.42±2.73 74.90±1.02 75.65±2.31
Pair GCN 70.21±0.84 66.68±1.85 70.09±0.86 70.81±1.01 71.42±1.07 69.85±0.87 69.22±0.89 68.75±0.77 70.23±1.08 71.02±0.87
APPNP 74.92±1.07 74.39±1.36 74.16±1.25 75.49±1.14 75.40±1.09 74.63±1.08 74.42±1.04 75.92±0.85 76.13±1.02
Wiki-cs Uniform GCN 56.99±0.53 58.24±0.91 56.96±0.51 49.84±1.04 51.66±0.98 48.94±1.73 53.87±1.10 55.23±1.13 56.75±0.67 58.04±0.63
APPNP 67.55±0.73 67.71±0.66 66.33±0.77 62.61±1.23 64.45±0.86 60.42±2.19 56.03±2.29 67.29±0.65 68.81±0.91
Pair GCN 55.36±0.47 56.39±0.45 55.63±0.49 51.34±0.83 52.42±0.79 49.68±1.37 53.53±1.21 54.37±0.74 55.06±0.49 56.87±1.04
APPNP 65.72±0.41 66.54±0.51 64.56±0.49 61.91±0.72 62.94±0.62 58.74±1.94 54.79±1.69 65.11±0.50 66.93±0.70
A-computers Uniform GCN 58.01±1.18 58.27±3.00 57.27±3.34 52.76±1.82 53.36±2.25 48.63±2.18 59.01±2.74 56.95±1.58 59.89±1.32
APPNP 64.20±1.73 64.83±3.22 65.19±3.71 56.03±5.00 56.88±2.84 52.74±3.89 65.32±3.90 64.98±2.23 65.44±1.85
Pair GCN 69.25±1.54 68.52±1.79 68.92±1.61 59.91±2.83 60.44±2.32 54.01±6.77 68.88±1.83 68.34±1.59 70.43±2.21
APPNP 73.98±0.75 71.36±1.79 73.30±0.88 62.28±2.32 63.71±1.88 63.20±3.36 74.51±0.74 73.80±0.99 75.11±0.73
For the method training with two GNNs models, the performance of Co-teaching varies among different datasets compared with Decoupling. There are stable improvements of Co-teaching in the prediction accuracy on Wiki-cs and A-computers, while it suffers severe performance degradation on the other three datasets. By contrast, the performance of some baseline is unsatisfactory on relatively larger datasets such as Wiki-cs and A-computers. For example, Trunc Lq and SACE perform worse on Wiki-cs and A-computers than the remaining three datasets. It also happens when training the model with UnionNET, SuLD-GNN, and CLNode. However, it observes that our method continues to demonstrate considerable performance improvement in complex network structures. For Wiki-cs and A-computers, it is noticed that there are more nodes, edges, and classes. In this case, Soft-GNN achieves the best results compared to other methods, with a performance improvement of approximately 2%. For SACE, the label is corrected with a soft label during the training period, while for UnionNET and SuLD-GNN, the label correction module is included. Then, it is more likely to introduce additional noise with the increase of classes and training samples, and the performance decreases.
As for NRGNN, we discover that it performs better on the relatively small datasets, Cora and Citeseer, and is not as good as basic GNNs on Pubmed and Wiki-cs. In the training stage, NRGNN would change the graph topology and add the edge between labeled nodes and unlabeled nodes, which are more probably from the same class. Thus, it is getting harder to make positive changes in topology in the face of increasing nodes or classes. Moreover, the NRGNN is more susceptible to suffering from out-of-memory error during training on datasets with complicated graph structures, such as Amazon computers.
Compared to GCN and APPNP, it is found that APPNP is more tolerant to label noise than GCN. It is observed that there is a significant increase in the prediction accuracy by around 3% to 10%, compared to GCN. Meanwhile, we find a similar situation in that the improvements of Soft-GNN and baselines on GCN are more obvious than those on APPNP. Since APPNP decouples the model propagation and prediction steps, it also blocks the propagation of the label noise and reduces the impact of the label noise during training. Therefore, APPNP performs much better than GCN.
In summary, the results verify the effectiveness of Soft-GNN by finding optimal strategies for data utilization during training.

5.3 Performance analysis

5.3.1 The level of noise rate

In this part, we observe the performance of GCN under different levels of noise rates. The results on Cora and Citeseer are presented in Fig.4. Overall, Soft-GNN performs better than the basic GNNs and the improvements are more clear with a higher noise rate. Compared with other baselines, the proposed method is the best and even improves the performance of GNNs without label noise. Compared with the baselines, Soft-GNN realizes flexible data utilization through a sample network and avoids introducing extra noise.
Fig.4 The results under different noise types and rates on Cora and Citeseer. (a) Uniform; (b) Pair; (c) Uniform; (d) Pair

Full size|PPT slide

In general, the methods designed for GNNs have better performance than methods without considering the dependency of nodes. For the baselines derived from the image field, co-teaching is the worst and performs worse than the basic GNNs in most cases. SACE is more or less like GCN and is observed to perform better under more label noise on Citeseer. As for the framework designed for GNNs, SuLD-GNN is minimally effective and NRGNN performs better than SuLD-GNN in some cases. SuLD-GNN needs label correction, while the effect of label correction depends on the exactness of the modified labels, and thus it does not always work. Compared with SuLD-GNN, SACE modifies the label with a soft label and can reduce the risks of making mistakes in some cases. For example, SACE performs much better than SuLD-GNN with uniform noise on Citeseer in Fig.4(c). Similarly, NRGNN adjusts the topology during training. It can be seen from Fig.4(a) that NRGNN works under the situation of severe label noise, while it is counterproductive to GNNs when the noise rate is small or equal to 0. A similar trend can be seen in Fig.4(c). Therefore, the strategy of topology adjustment works up to a point and introduces extra noise.
In addition, we find that there is a difference between the two types of noise in the impact on model performance, even though the overall trend is similar. Compared with pair noise, GCN is more sensitive to uniform noise, while it is relatively difficult to improve the performance of GNNs under pair noise.

5.3.2 The performance of mislabeled node

In this part, we explore how the proposed method works and estimate the performance of mislabeled nodes and their neighborhoods. We test Soft-GNN based on GCN under the noise rate=0.4. Fig.5 illustrates the performance gains of Soft-GNN over GCN about Micro-F1 and Macro-F1, which is evaluated with the true labels. The neighborhood of mislabeled nodes consists of top-5 structural neighboring nodes. It can be seen from the figure that Soft-GNN improves the prediction accuracy of GCN and the performances on the mislabeled node, and its neighborhood are also increased. It indicates that the model trained by Soft-GNN is more robust to label noise and mitigates the propagation of noise to the corresponding neighborhood in GNNs at the same time. We also notice that the improvements on mislabeled nodes are less than that on their neighborhood, which is more obvious in the dataset with pair noise. Then, we have a good look at Fig.5(a) and Fig.5(b). It turns out that the increase of macro-F1 is much higher than that of micro-F1 among mislabeled nodes and their neighborhoods. It confirms that the proposed method keeps the mislabeled nodes and their neighboring nodes from the confusion of label noise and learns a more robust GNNs model.
Fig.5 The performance gains over GCN on Citeseer under noise rate=0.4. (a) Micro-F1; (b) Macro-F1

Full size|PPT slide

5.4 Time and memory comparison

In this section, we make a comparison of training time and memory efficiency among baselines and our methods. We estimate the average running time for 200 epochs and the corresponding GPU memory usage. Tab.3 presents the results of the ratio to basic GCN with 64 hidden units. According to the ratio, the training time of Soft-GNN increases, and the memory usage is almost the same as GCN. For Soft-GNN, the time increase is from the weight generation step and is positively related to the number of training samples. For example, the training time on Wiki-cs is more than that on Citeseer and Pubmed datasets.
Tab.3 The comparison of time and memory efficiency over GCN with 64 hidden units
Ratio to GCN Time GPU Memory
Citeseer Pubmed Wiki-cs Citeseer Pubmed Wiki-cs
Co-teaching 1.38 1.35 1.31 1.03 1.05 1.02
Trunc Lq 0.50 0.44 0.52
UnionNET 0.72 0.71 0.73
NRGNN 16.18 70.79 61.09 1.93 1.91 9.35
Soft-GNN 2.00 1.85 3.26 1.04 1.02

“–” means the ratio is very close to 1.

The other baselines, except for NRGNN, take up the same amount of time and memory during training. We notice that Truc Lq and UnionNET spend less time than GCN. This is because of the truncation of partial samples and the time reduction of backward propagation. For NRGNN, it can be seen that NRGNN needs much more time for training, for example, 16 times on Citeseer. Furthermore, the additional time consumption increases with the number of nodes and edges. It is observed that training NRGNN takes a significant amount of time on Pubmed and Wiki-cs datasets since there are more nodes and edges on Pubmed and Wiki-cs datasets. As for the memory usage of NRGNN, the extra usage is mainly from the topology adjustment and depends on the number of edges. The adjustment of graph structure heavily engages time and memory. In summary, Soft-GNN achieves a steady performance improvement and only increases limited time and memory consumption.

5.5 Hyper-parameter analysis

We explore the effect of two hyper-parameters, the number of the structural neighboring nodes k and the length of the node similarity sequence q. The range of k is [1,5] and the value of q is chosen from [5,10] with a step of 1 and [10,50] with a step of 10. As shown in Fig.6, we test the prediction accuracy of Soft-GNN on Cora with uniform noise and the noise rate is set to 0.3. With the number k increasing, the performance drops in general. In that case, more nodes are added to the structural neighborhood and the newly-selected nodes are more likely to be from other classes. We estimate the ratio of the node pair from the same class between the nodes and its structural neighborhood, denoted by pair homophily. Generally, the larger value of k usually leads to the lower pair homophily. Thus, the k is set to a relatively small value. It is also found that too small or big value of q is not a good choice. When the k is smaller, it is better to set q less than 20. The appropriate combination is that k and q are from {1,2,3} and {5,20}. For evaluations, we set k=2 and q=10, except for q=20 on the Wiki-cs dataset.
Fig.6 The performance comparison with different hyper-parameters: q and k on Cora under noise rate=0.3 with GCN

Full size|PPT slide

6 Conclusion

In the paper, we analyze the impact of mislabeled nodes when training GNNs under label noise. It is found that the label noise causes the prediction deviation and disrupts the original relationships with neighboring nodes. Based on the observations, we propose a simple yet effective framework for training robust GNNs, denoted by Soft-GNN. The introduced sample network utilizes the observed training deviations and outputs sample weights to realize the self-adaptive data utilization. The experimental results show that Soft-GNN can learn the robust GNNs model under different noise types and rates. Moreover, it is still better when the data is noise-free. In addition, Soft-GNN improves the performance of mislabeled nodes and their neighboring nodes. We focus on the label noise for the data corruption at the node level, and further study will pay close attention to other types of data noise.

Yao Wu is an assistant professor at the National University of Defense Technology, China. She obtained her PhD degree from Huazhong University of Science and Technology, China in 2023, following the completion of her BS degree from the same institution in 2016. Her primary research focus is on graph data mining and analysis

Hong Huang is an associate professor at Huazhong University of Science and Technology, China. She received her PhD degree from the University of Göttingen, Germany in 2016 and her ME degree from Tsinghua University, China in 2012. Her research interests lie in social network analysis, data mining, and knowledge graph

Yu Song received his BS degree in electronic information engineering and his ME degree in computer science, in 2018 and 2021, from Huazhong University of Science and Technology, China. Currently, he is pursuing his PhD degree at the Université de Montréal, Canada. His research interests include graph data mining and NLP

Hai Jin is a Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST) in China. Jin received his PhD degree in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz in Germany. Jin worked at The University of Hong Kong, China between 1998 and 2000, and as a visiting scholar at the University of Southern California, USA between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is a Fellow of IEEE, Fellow of CCF, and a life member of the ACM. His research interests include computer architecture, parallel and distributed computing, big data processing, data storage, and system security

References

[1]
Wu B, Li J, Hou C, Fu G, Bian Y, Chen L, Huang J. Recent advances in reliable deep graph learning: adversarial attack, inherent noise, and distribution shift. 2022, arXiv preprint arXiv: 2202.07114
[2]
Li S Y, Huang S J, Chen S . Crowdsourcing aggregation with deep Bayesian learning. Science China Information Sciences, 2021, 64( 3): 130104
[3]
Hoang N T, Choong J J, Murata T. Learning graph neural networks with noisy labels. In: Proceedings of the ICLR LLD 2019. 2019
[4]
Zügner D, Akbarnejad A, Günnemann S. Adversarial attacks on neural networks for graph data. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2847−2856
[5]
Li Y, Yin J, Chen L. Unified robust training for graph neural networks against label noise. In: Proceedings of the 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2021, 528−540
[6]
Zhuo Y, Zhou X, Wu J. Training graph convolutional neural network against label noise. In: Proceedings of the 28th International Conference on Neural Information Processing. 2021, 677−689
[7]
Du X, Bian T, Rong Y, Han B, Liu T, Xu T, Huang W, Huang J. PI-GNN: a novel perspective on semi-supervised node classification against noisy labels. 2021, arXiv preprint arXiv: 2106.07451
[8]
Dai E, Aggarwal C, Wang S. NRGNN: learning a label noise resistant graph neural network on sparsely and noisily labeled graphs. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 227−236
[9]
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of ICLR 2017. 2017
[10]
Huang L, Zhang C, Zhang H. Self-adaptive training: beyond empirical risk minimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1624
[11]
Frenay B, Verleysen M . Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25( 5): 845–869
[12]
Bootkrajang J, Kabán A . Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics, 2013, 29( 7): 870–877
[13]
Shi X, Che W . Combating with extremely noisy samples in weakly supervised slot filling for automatic diagnosis. Frontiers of Computer Science, 2023, 17( 5): 175333
[14]
Xiao T, Xia T, Yang Y, Huang C, Wang X. Learning from massive noisy labeled data for image classification. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2691−2699
[15]
Tian Q, Sun H, Peng S, Ma T . Self-adaptive label filtering learning for unsupervised domain adaptation. Frontiers of Computer Science, 2023, 17( 1): 171308
[16]
Beck C, Booth H, El-Assady M, Butt M. Representation problems in linguistic annotations: ambiguity, variation, uncertainty, error and bias. In: Proceedings of the 14th Linguistic Annotation Workshop. 2020, 60−73
[17]
Biggio B, Nelson B, Laskov P. Support vector machines under adversarial label noise. In: Proceedings of the 3rd Asian Conference on Machine Learning. 2011, 97−112
[18]
Abellán J, Masegosa A R. An experimental study about simple decision trees for bagging ensemble on datasets with classification noise. In: Proceedings of the 10th European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty. 2009, 446−456
[19]
Wang R, Liu T, Tao D . Multiclass learning with partially corrupted labels. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29( 6): 2568–2580
[20]
Sigurdsson S, Larsen J, Hansen L K, Philipsen P A, Wulf H C. Outlier estimation and detection application to skin lesion classification. In: Proceedings of 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing. 2002, 1049−1052
[21]
Bouveyron C, Girard S . Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognition, 2009, 42( 11): 2649–2658
[22]
Brodley C E, Friedl M A . Identifying mislabeled training data. Journal of Artificial Intelligence Research, 1999, 11: 131–167
[23]
Thongkam J, Xu G, Zhang Y, Huang F. Support vector machine for outlier detection in breast cancer survivability prediction. In: Proceedings of the Asia-Pacific Web Conference. 2008, 99−109
[24]
Miranda A L B, Garcia L P F, Carvalho A C P L F, Lorena A C. Use of classification algorithms in noise detection and elimination. In: Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems. 2009, 417−424
[25]
Dedeoglu E, Kesgin H T, Amasyali M F . A robust optimization method for label noisy datasets based on adaptive threshold: adaptive-k. Frontiers of Computer Science, 2024, 18( 4): 184315
[26]
Wang Y, Kucukelbir A, Blei D M. Robust probabilistic modeling with Bayesian data reweighting. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 3646−3655
[27]
Jiang L, Zhou Z, Leung T, Li L J, Li F F. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2304−2313
[28]
Shu J, Xie Q, Yi L, Zhao Q, Zhou S, Xu Z, Meng D. Meta-weight-net: learning an explicit mapping for sample weighting. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 172
[29]
Nguyen D, Mummadi C K, Ngo T P N, Nguyen T H P, Beggel L, Brox T. SELF: learning to filter noisy labels with self-ensembling. In: Proceedings of the 8th International Conference on Learning Representations. 2020
[30]
Chen P, Liao B, Chen G, Zhang S. Understanding and utilizing deep neural networks trained with noisy labels. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 1062−1070
[31]
Zheng G, Awadallah A H, Dumais S. Meta label correction for noisy label learning. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 11053−11061
[32]
Liu J, Li R, Sun C . Co-correcting: noise-tolerant medical image classification via mutual label correction. IEEE Transactions on Medical Imaging, 2021, 40( 12): 3580–3592
[33]
Malach E, Shalev-Shwartz S. Decoupling "when to update" from "how to update". In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 961−971
[34]
Han B, Yao Q, Yu X, Niu G, Xu M, Hu W, Tsang I W, Sugiyama M. Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8536−8546
[35]
Guo X, Wang W . Towards making co-training suffer less from insufficient views. Frontiers of Computer Science, 2019, 13( 1): 99–105
[36]
Gasteiger J, Bojchevski A, Günnemann S. Predict then propagate: Graph neural networks meet personalized pagerank. In: Proceedings of the International Conference on Learning Representations (ICLR). 2019
[37]
Wei X, Gong X, Zhan Y, Du B, Luo Y, Hu W. CLNode: curriculum learning for node classification. In: Proceedings of the 16th ACM International Conference on Web Search and Data Mining. 2023, 670−678
[38]
Liu Y, Wu Z, Lu Z, Wen G, Ma J, Lu G, Zhu X. Multi-teacher self-training for semi-supervised node classification with noisy labels. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023, 2946−2954
[39]
Kuang K, Cui P, Athey S, Xiong R, Li B. Stable prediction across unknown environments. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 1617−1626
[40]
Hwang M, Jeong Y, Sung W. Data distribution search to select core-set for machine learning. In: Proceedings of the 9th International Conference on Smart Media and Applications. 2020, 172−176
[41]
Paul M, Ganguli S, Dziugaite G K. Deep learning on a data diet: finding important examples early in training. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2021, 20596−20607
[42]
Wang H, Leskovec J. Unifying graph convolutional neural networks and label propagation. 2020, arXiv preprint arXiv: 2002.06755
[43]
Dong W, Wu J, Luo Y, Ge Z, Wang P. Node representation learning in graph via node-to-neighbourhood mutual information maximization. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 16620−16629
[44]
Hochreiter S, Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735–1780
[45]
Sen P, Namata G, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T . Collective classification in network data. AI Magazine, 2008, 29( 3): 93–106
[46]
Namata G, London B, Getoor L, Huang B. Query-driven active surveying for collective classification. In: Proceedings of the Workshop on Mining and Learning with Graphs. 2012
[47]
Mernyei P, Cangea C. Wiki-CS: a wikipedia-based benchmark for graph neural networks. In: Proceedings of the Graph Representation Learning and Beyond Workshop on ICML. 2020
[48]
Shchur O, Mumme M, Bojchevski A, Günnemann S. Pitfalls of graph neural network evaluation. 2019, arXiv preprint arXiv: 1811.05868
[49]
Zhang Z, Sabuncu M R. Generalized cross entropy loss for training deep neural networks with noisy labels. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8792−8802
[50]
Zhu M, Wang X, Shi C, Ji H, Cui P. Interpreting and unifying graph neural networks with an optimization framework. In: Proceedings of the Web Conference 2021. 2021, 1215−1226

Acknowledgment

The work was supported by the National Natural Science Foundation of China (Grant No. 62127808).

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press
AI Summary AI Mindmap
PDF(4430 KB)

Supplementary files

FCS-23575-OF-YW_suppl_1 (307 KB)

694

Accesses

1

Citations

Detail

Sections
Recommended

/