A comprehensive survey on graph neural network accelerators

Jingyu LIU, Shi CHEN, Li SHEN

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (2) : 192104.

PDF(16089 KB)
Front. Comput. Sci. All Journals
PDF(16089 KB)
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (2) : 192104. DOI: 10.1007/s11704-023-3307-2
Architecture
REVIEW ARTICLE

A comprehensive survey on graph neural network accelerators

Author information +
History +

Abstract

Deep learning has gained superior accuracy on Euclidean structure data in neural networks. As a result, non-Euclidean structure data, such as graph data, has more sophisticated structural information, which can be applied in neural networks as well to address more complex and practical problems. However, actual graph data obeys a power-law distribution, so the adjacent matrix of a graph is random and sparse. Graph processing accelerator (GPA) is designed to handle the problems above. However, graph computing only processes 1-dimensional data. In graph neural networks (GNNs), graph data is multi-dimensional. Consequently, GNNs include the execution processes of both traditional graph processing and neural network, which have irregular memory access and regular computation, respectively. To obtain more information in graph data and require better model generalization ability, the layers of GNN are deeper, so the overhead of memory access and computation is considerable. At present, GNN accelerators are designed to deal with this issue. In this paper, we conduct a systematic survey regarding the design and implementation of GNN accelerators. Specifically, we review the challenges faced by GNN accelerators, and existing related works in detail to process them. Finally, we evaluate previous works and propose future directions in this booming field.

Graphical abstract

Keywords

graph neural network / accelerators / graph convolutional networks / design space exploration / deep learning / domain-specific architecture

Cite this article

Download citation ▾
Jingyu LIU, Shi CHEN, Li SHEN. A comprehensive survey on graph neural network accelerators. Front. Comput. Sci., 2025, 19(2): 192104 https://doi.org/10.1007/s11704-023-3307-2

1 Introduction

In recent years, deep learning models such as convolutional neural networks (CNNs), transformers, and recursive neural networks (RNNs) have performed well in dealing with Euclidean structure data, obtaining outstanding predicted results. In practical situations, while deep learning has gained great success on Euclidean data, there is an increasing number of applications where data are generated from non-Euclidean domains and need to be analyzed. Non-Euclidean structure data, such as graph data, is a typical representation. Graph data contains abundant structural information and can acquire the properties of adjacent nodes to update its nodes. And its structure content caters to real world’s information propagation. As a result, to get more accurate prediction in graph data. Graph neural network (GNN) [1-25] comes into existence. GNN is a deep learning model to handle graph data, which makes machine learning applied into graph data structure. Then various GNN models are proposed, graph convolution networks (GCNs) [7-19,26-35], graph recurrent network (GRN) [4], graph attention networks (GATs) [2], graph auto-encoders (GAEs) [1], graph generative networks (GGNs) [3], etc. These GNN models play an important role in these applications, i.e., node classification, graph clustering, link prediction, node clustering, protein network and so on.
The execution of GNN models integrates the features of graph computation [36-46] and neural network, which aggregate the neighbor nodes’ information through the product of adjacent matrix and vertex property, and update the vertex’s property through multilayer perceptron (MLP). As a result, the GNN model exhibits hybrid execution patterns in computation and memory access. The aggregation phase features irregular memory access and computation, while the combination phase is vice versa. Moreover, the dataset of the GNN is composed of real-world graph data, which contains a significant number of nodes and edges. Therefore, as the depth of the model layers increases in the GNN, the amount of computation and irregular memory access cause more significant performance degradation and energy consumption, accounting for execution time and energy expenditure. Nevertheless, the accelerator designed for graph computation and general neural networks cannot be applied to handle this simultaneously. To address the aforementioned problems, there are numerous emerging studies focusing on GNN accelerators, as GCN is the most typical representation of GNNs and all variants of GNN are derived from GCN. The survey on GNN accelerators mainly focuses on those of GCN.
Because the special execution pattern on GNN, conventional architectures, i.e., CPU and GPU [47,48] is not fit for the work, occurring the problem of imbalance of workload, excessive memory access time, low energy efficiency,etc. The reason is explicit, real-world graphs often follow a power-law distribution in a sense that most of vertices are associated with a few edges. The irregularity in aggregation phase inherently falls short in exploiting memory- and instruction-level parallelism on traditional processors. Besides, the dimension of GNN’s node feature vectors can be very high, e.g., each node in Citeseer graph has 3703 features, which can lead to paramount processing costs in the combination phase. Consequently, lots of accelerating GNN works are in demand. The first work of accelerator on GNN application is HyGCN [49], which designs a two-phase architecture to accomplish two computation patterns respectively.
From the first work yielding, plenty of works addressing accelerating GNN are emerging. At present, the trend and opportunities of Domain-Specific Architecture (DSA) and emerging architecture, e.g., process in memory (PIM) is developing. Hardware platform templates, e.g., Field Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC), are in line with the demand of times. With regard to the acceleration strategy in the past few years, at the beginning, systolic arrays and ReRAMs are used frequently in general and PIM architecture respectively. Then, pipeline of stages and buffers on chips are also included. Besides, the co-design of algorithm and architecture, i.e., software and hardware, is the main stream of design plan. What’s more, in-storage architecture is also applied. Mostly, the process of adjacent matrix is often considered because it is the cause of abundant memory access and energy consumption. In a sense, most of work are on DRAM, even trying on cache with some methods. But graph data and GNN scale is still increasing, which take scalability of GNN into consideration.
To address above significant aspects and gaps, this paper provides a comprehensive and systematic overview and survey of GNN accelerators:
● We provide a unified framework to categorize the studies on GNN accelerators, which can sort out the architectures and methods.
● We give an introduction of GNN accelerators’ latest work to have a master of research directions.
● We provide a comprehensive overview of unique characteristics of GNN accelerators as well as challenges of it incurred by them.
● Lastly, open issues and prospects for GNN accelerators research are discussed.
The rest of this paper is organized as follows. Section 2 includes a detailed introduction to basic components of GNN and its accelerator, and briefly summarizes recent progress. At the same time, we give a classification and general framework of GNN accelerators. Challenges and characteristics are given in Section 3. So the solutions to the challenges are introduced in Section 4. What’s more, we make a review of previous work and give a direction on the research in Section 5. Finally, this paper concludes in Section 6.

2 Preliminaries

In this section, we give a brief introduction to the preliminaries of graph, GNN and its acceleration architecture, including graph representation and several common GNN models and architecture classification. Then, we summarize some unique characteristics of graph and GNN, followed by the related work of them on general-purpose processors. The characteristics of GNN and the related work further motivate our survey work on GNN accelerators. At last, we make a category framework of GNN accelerators to analyze them further.

2.1 Graph

Graph is a type of data structure consisting of vertices and edges further associated with them. A graph, expressed by G, can be generally defined as G=(V,E) where V represents the vertex set and E indicates the edge set. For a directed graph, an edge can be represented as e = (vi, vj), indicating that there is an edge pointing from vi to vj. To indicate E in an intuitive way, we use adjacent matrix, expressed by A, representing the neighbor relationships between vertices. We provide definitions of basic graph concepts. For easy retrieval, we summarize the commonly used notations in Appendix A.
In particular, a vertex and an edge can also be attributed with single or multiple properties. Real-world natural graphs, e.g., social networks, usually have the following three general features.
● Power-law distribution. Only a few vertices have associated most of the edges, leading to a severe workload imbalance issue with a large number of conflicts when high-degree vertices are being aggregated.
● Sparsity. The average number of vertex degrees is relatively small. The sparsity of graphs can result in poor locality for data accesses, which causes irregular memory access and computation.
● Small-world structure. Any two vertices in the graph can be connected with only a small number of hops. This feature will make it difficult for partitioning the graph efficiently.

2.2 Graph neural network

Different from CNN, GNN is a type of neural network, which learns from graph data structure and gets accurate predictions in non-Euclidean structure data. This is a general term for algorithms that use neural networks to learn graph structure data, extract and explore features and patterns in graph structure data, and meet the requirements of graph learning tasks such as clustering, classification, prediction, segmentation, and generation.
The previous GNN used RNN to deal with undirected graphs, directed graphs, label graphs and cyclic graphs. After that, it is believed that the first type of GNN is not able to deal with the complex and changeable graph data in reality. As a result, CNN model’s architecture is applied to the graph. Through the ingenious transformation of the convolution operator, it proposes a graph convolutional network (GCN), and derives many variants. GCN realizes the translation invariance, local perception and weight sharing of CNN on the graph, providing ideological guidance and reference for the construction and improvement of other GNN frameworks. Besides, all GNN models are based on GCN variants, so most of the accelerators work on GNNs is also based on GCN design, and the subsequent introduction and analysis in this survey are based on GCN.
GCN has two main methods for convolution operation: one is based on spectral decomposition, that is, spectral decomposition graph convolution, the other is based on node space transformation, that is, spatial graph convolution. Because the spectral GNN transforms all nodes in the graph at the same time, which requires high computing and storage space and is difficult to adapt to the analysis of large-scale graph data sets, the subsequent introduction and analysis in this survey are based on spatial GCN, which is main optimization model in accelerators work. The variant model is in Appendix B.
The existing mainstream graph neural network algorithm can be abstracted as a model:
hv=Aggregate(huN(v)(k1)),
hv(k)=Combine(hv(k1),hv),
where hv(k) denotes the feature vector obtained by updating node v in the kth layer. Overall, graph neural network is a hybrid of traditional graph computation and neural network. In the execution of each layer, the graph neural network first traverses the entire graph (or sampled subgraphs), and then aggregates the neighborhood information for each node. The node’s intermediate features hv in this layer are obtained by aggregating the information of each node. The process is similar to the traditional graph computation, as in Eq. (1). The node’s intermediate features are combined with the node’s features obtained in the previous layer. The node’s output features in this layer are obtained by updating vector hv(k), the process is similar to that of traditional neural networks, as in Eq. (2). And the entire process is in Fig.1.
Fig.1 The entire process of GNN [50]

Full size|PPT slide

2.3 Accelerator architecture

As one of the most popular research directions in artificial intelligence currently, GNN has produced a large number of different algorithms. The main general programming model is showed in Algorithm 1. Many people have carried out research and development work on graph-oriented neural network framework and extension library based on a common platform. Because graph neural network applications and traditional neural network applications have similarities and differences in many execution methods, many existing work is based on the mature neural network framework, For example, PyTorch and Tensorflow are extended to form a new framework supporting the application of graph neural network. These frameworks and extension libraries support a variety of different graph neural network algorithms, most of which have been open source, so that users can flexibly construct GNN on them. The mainstream three graph neural network frameworks and extension libraries will be introduced below.

Full size|PPT slide

PyTorch Geometric (PyG), is the most commonly used general graph neural network extension library at present. It is based on the PyTorch framework, and supports running on CPU and GPU platforms. It has been open source. In addition to common graph structure data processing methods, PyG also provides support for methods such as relational learning and 3D data processing. PyG provides users with a universal message passing interface, enabling users to implement new research ideas on graph neural network algorithms on PyG.
DeepGraph Library (DGL) is an open source graph neural network extension library, which is based on a variety of existing neural network framework extensions, and currently supports Tensorflow, PyTorch and MXNet, And minimize the workload when users migrate the graph neural network model across platforms. DGL abstracts the calculation process of the graph neural network as a user-configurable message delivery unit, and extracts the connection between the sparse matrix multiplication in the graph neural network and the message delivery mechanism, The calculation operation is integrated into generalized sparse dense matrix multiplication (g-SpMM) and generalized sampling dense matrix multiplication (g-SDDMM). In addition, DGL also introduces different types of parallel strategies, by adopting node-level parallelism for gSpMM and edge-level parallelism for g-SDDMM, making it have high execution speed and memory access efficiency.
AliGraph [51] is an open source distributed framework for the large-scale graph neural networks released by Alibaba Group. AliGraph’s system is divided into operators Sampling and data storage are three levels. The operator level decomposes different graph neural network algorithms into three operators in the system: sample, aggregate and combine, and performs optimization calculations; The sampling layer integrates a variety of different sampling strategies to enable it to generate training samples quickly and accurately; The data storage layer implements efficient data organization and storage through flexible graph partitioning, separating storage of different attributes in the graph, and caching frequently used data.
After introducing the software aspects of GNN accelerator, the hardware hierarchy is also considered necessarily. Because of the special phase like GPA, the main hardware platforms of GNN accelerators have three types, which are ASIC, FPGA (CPU-FPGA), and emerging devices such as PIM. General processors, i.e., CPU and GPU, cannot accelerate the GNN model directly because of the two-phase execution pattern, which needs dynamic computation and irregular memory access in aggregate stage and vice versa in combine stage. As for the accelerators hardware architecture, there are centralized and separate hardware design plans. The typical representation for separate designs are HyGCN, GraphACT [52] for accelerating aggregate and combine stage separately, HyGCN is an early proposed two-segment GNN inference accelerator using ASIC, where the two main computational modules perform the aggregation operation and the combination operation in GNN, respectively. The aggregation part consists of an access controller containing sparse elimination and multiple Single Instruction Multiple Data (SIMD) cores. “Multiple single-instruction multiple-data cores can be flexibly configured to aggregate at high utilization. The combination part is a matrix multiplication unit composed of multiple configurable Systolic Array (SA) [53] architectures, which can work independently or in concert. GraphACT proposes a GNN training system implemented by a host CPU+FPGA accelerator. The host part, in addition to executing the softmax function and other results and the calculation of the loss function during training, also preprocesses the GNN task to be computed by an elaborate matching algorithm, arranges the data to be computed in the most conducive way for reuse, and transfers them to the memory on the FPGA board; the FPGA part contains two major hardware function modules, which perform vector and matrix computation, and the corresponding on-chip cache. EnGN [54] and AWB-GCN [26] are the centralized type, which gather the two-stage computation pattern in the same hardware for execution. AWB-GCN focuses mainly on the problem of load imbalance in sparse matrix multiplication computation, three techniques of distribution smoothing, remote swapping, and special row remapping are proposed for load balancing, and corresponding data path support is designed in hardware to improve the utilization of structurally uniform PE units. Since the sparse matrix of the sparse matrix multiplication computation in GNN is determined by the tasks, and the sparsity of each task is different, and it also proposes an auto-tuning module to adjust the deployment strategy of the next cycle by the current cycle execution, so that the hardware utilization can be improved by auto-tuning when the GNN algorithm is run several times on the same data set. EnGN, this work proposes a Neural Graph Processing Unit (NGPU) to perform sparse matrix or dense matrix multiplication computations in GNNs. Specifically, the NGPU consists of a two-dimensional array of interconnected Processing Element (PE), each of which contains a multiplicative accumulation computation unit, a degree aware feature cache, and a weight cache to better exploit the data time reuse property. In addition, each column of PEs is connected by a topology called Ring Edge Reduce (RER), and it is through this data path that EnGN performs sparse matrix operations to aggregate the feature data to be aggregated through several cycles of data transfer, in conjunction with preprocessing of graph data. PIM, as an emerging technology used in GNN to accelerate computation and memory access, e.g., ReRAM, which uses crossbar as resistance, wordline as voltage from input to DAC, bitline as current from the output of voltage and resistance, gathering the multiplication and accumulation to ADC. This detailed introduction will be in Section 4. The present accelerator hardware architecture will be displayed in Tab.1.
Tab.1 Current GNN acceleration architectures
Name Model supported Stage optimized Platform Hybrid or uniform
HyGCN GCNs Inference ASIC Hybrid
Auten et al. [55] GCNs, GATs Inference ASIC Hybrid
GraphACT GCNs Training CPU-FPGA Hybrid
DeepBurning-GL [56] GCNs Inference FPGA Hybrid
AWB-GCN GCNs Inference ASIC Uniform
GCNAX [28] GCNs Inference ASIC Uniform
Cambricon-G [57] GCN,GraphSAGE Both ASIC Uniform
BlockGNN [58] GCNs,GAT Inference FPGA Uniform
ReGraphX [59] GNNs Training PIM Uniform
I-GCN [30] GCNs Inference ASIC Hybrid
GRIP [60] GCN, GraphSAGE, GIN Inference ASIC Hybrid
EnGN [54] GCNs, GRN Inference ASIC Uniform
Huang et al. [27] GCNs Inference PIM Uniform
GCOD [29] GCNs Inference ASIC Uniform
ReGNN [61] GNNs Inference ASIC Hybrid
ReGNN [62] GCNs Inference PIM Hybrid
Graphite [31] GNNs Both CPU Uniform
SmartSAGE [63] GraphSAGE Training CPU-FPGA Uniform
CoGNN [64] GNNs Training GPU Uniform
PASGCN [34] GCNs Inference PIM Uniform
FlowGNN [65] GNNs Inference ASIC Hybrid
SGCN [33] GCNs Inference ASIC Hybrid
GROW [32] GCNs Inference ASIC Uniform
GraNDe [66] GCNs Inference ASIC Hybrid
GNNAdvisor [25] GNNs Both GPU Uniform
GNNLab [20] GNNs Training GPU Uniform
Degree-Quant [22] GNNs Both CPU-GPU Uniform
Flexgraph [21] GNNs Training CPU Uniform
SGQuant [23] GNNs Both Memory-constrained Devices Uniform
QGTC [24] GNNs Both GPU Uniform
Xu et al. [35] GCNs Training GPU Uniform
Table 1 displays present GNN accelerators’ holistic architecture parameters, including the model supported, stage supported, platform and hardware design supporting computation phase, which is two phases separately or simultaneously.

2.4 Categories and frameworks

With regard to GNN accelerators, we conduct a systematic review on them. It aims at exploring the key issues in the design and implementation of GNN accelerators. As summarized in Fig.2, we have identified a complete set of core components for GNN accelerators, which involves three major aspects: preprocessing, computation model, and scheduling strategies.
Fig.2 The design aspects of GNN accelerators

Full size|PPT slide

Preprocessing A GNN accelerator mostly has few storage resources in actual life. Graphs are needed to be partitioned. Preprocessing is an important part that implements on graph data (e.g., adjacent matrix) for trying to make the graph data fit into the memory capacity of the GNN accelerator. It is also the key to match a certain processing model and appropriate hardware resource before the formal processing. What’s more, this can enhance data locality, for better regular memory access and more swift data communications.
Computation model The GNN computation model serves as the main execution part of GNN accelerator design. Parallelism often consists of vertex-intra and vertex-inter parallelism, which belongs to aggregate phase. The order of computation is also considered in it because the dimension of vertex property is different in respective phase. The implementation of this part generally relies on some hardware platform, e.g., FPGA, ASIC, or Processing-In-Memory (PIM). Different specifications have different concerns on hardware designs and sophisticated software co-designs for high throughput and energy efficiency.
Scheduling strategies This part aims at how to schedule a large number of operations and data on a limited set of hardware resources of GNN accelerators. The scheduling component often involves data communication, execution mode, and scheduling scheme to get a higher utilization of resources for solving imbalance workload problems.

3 The characteristics and challenges

3.1 GNN architecture characteristics

From the above sections, we have a basic knowledge of GNN accelerators. In summary, we outline three typical features of GNN architecture characteristics. They are two-phase computations execution mode, complex dataset and poor scalability.
Two-phase computations execution mode As proposed before, GNN model is composed of two phases in every iteration, aggregate and combine. Aggregate phase is like graph computation execution, gathering the nodes’ property according to the linking edges information to get the node’s new features, i.e., message passing mechanism. Combine phase is like neural networks computation, passing weight, bias and activation functions to update node’s new features. As a result, GNN possesses graph computation and neural networks characteristics simultaneously, which cannot be efficiently implemented on conventional computational resources. We can learn from Tab.2, that GNN accelerator should satisfy two opposite computation and memory access execution mode in one architecture system.
Tab.2 Execution behaviors in GNNs [49]
Aggregation Combination
Access pattern Indirect and Irregular Direct and regular
Data reusability Low High
Computation Pattern Dynamic and Irregular Static and regular
Computation Intensity Low High
Execution Bound Memory Compute
Complex dataset To further know the GNN features, we show the dataset of general GNN model in Tab.3. According to the table, we can learn that the dataset of GNN consists of considerable edges and nodes with multiple-dimension features, which is different from GPA’s one-dimension nodes. What’s more, following the power-law distribution, the adjacent matrix of the graph’s sparse degree is up to 99.9%, which makes GNN computation and memory access dynamic and irregular, causing more execution time and energy consumption, as is shown in Fig.3. The weight in GNN is dense and regular, which lets GNN keep a balanced computation in a dilemma.
Tab.3 Current common GNN dataset
Name Nodes Edges Features Storage Classes
Pummbed(PB) 19717 88648 500 38MB 3
Cora(CR) 2708 10556 1433 15MB 7
Citseer(CS) 3327 9104 3703 47MB 6
Reddit(RD) 232965 11465892 602 1.8GB 41
NELL(NE) 65755 266144 5414 1.3GB 210
Fig.3 The non-zero elements of GNN adjacent matrix (a) and weight matrix (b) [26]

Full size|PPT slide

Poor scalability Since GNN has the features that is similar to CNN, why not use multiple layers to get more accurate predictions? As is shown in Fig.4, one layer represents one hop neighbor information gathering, if we iterate considerable layers to get more hops information, the amount of computation and memory access is huge, which is bound to platform and also causes overfitting, severe unbalanced workload, etc.
Fig.4 An example of inference on vertex B using a two-layer GCN. The nodeflow describes the propagation of features within a message-passing layer (MPL) [60]. (a) Input graph; (b) nodeflow; (c) inference dataflow

Full size|PPT slide

3.2 GNN accelerators design challenges

On the basis of GNN various features, there are five main challenges that happen in designing GNN accelerators.
Parallelism In GNN model, the parallelism used is mainly in aggregate phase. As is shown in Fig.5. Aggregate phase accounts for the execution time. Fig.5 illustrates their execution time ratio on different models and datasets. Their execution times differ due to the variable length of feature vectors and the execution flow of GCNs. To accelerate GNN, we can utilize vertex-intra and vertex-inter parallelism to improve the computation performance, which is shown in Fig.1. Vertex-intra parallelism is to gather a destination node’s all neighbors properties to aggregate in parallel. Vertex-inter parallelism is to process every destination node’s aggregate phase simultaneously. Nevertheless, to efficiently make use of these two parallelism is a problem. Especially in PIM architecture [34], the two mode parallelism is bounded by the crossbar utilization. What’s more, leveraging both parallelism is also a challenge on software and hardware level, because these two parallelisms needs to be compromised on the variable length of feature vectors and the execution flow of GCNs.
Fig.5 Execution time breakdown of the two phases [49]

Full size|PPT slide

Data locality As is proposed before. The adjacent matrix A is sparse and random, which has a poor locality, letting aggregate phase occur considerable dynamic computation and irregular memory access. So we hope to find a way to enhance the locality, decreasing the energy consumption caused by memory access. Data locality is mainly about adjacent matrix’s optimizations in existing works, because there are lots of zeros in graph’s adjacent matrix following power-law distributions, which results in plenty of useless computations. Yan et al. [49] adopts dynamic sparsity elimination to reduce redundant accesses, which is interval-shard partition, interval-wise feature access, window sliding and window shrinking. However, the operation complexity is of great height.
Data reuse Data reuse mainly focuses on node features’ reuse, and graphs following power-law distributions have such characteristics that there are always few nodes having most neighbors while most nodes having few neighbors. To this end, if these few nodes and their neighbors’ features can be reused in GNN computation, much redundant storage and computation resources can be saved. What’s more, the aggregate phase will have much redundancy in computation and communications, which lots of nodes features will be reused for aggregate for many times. As a result, if we can improve the reuse of nodes’ property in GNN, the two phases will be both accelerated and the pressure on cache is also reduced. But the dataset of GNN is big and sparse, it’s challenging to deal with it.
Utilizations of resources Apart from sparsity in adjacent matrix, when executing GNN, imbalance of workload is an obvious problems, that is, some vertices have more neighbors while some have only a few, so when executing aggregate phase, the computation resource will be occupied by more neighbors’ nodes. Hence, according power-law distribution, there will be lots of process elements idle in workload while others busy, causing time-consuming results. We need the resources scheduling in computation to address it.
Single stage optimized From Tab.1, we can see that most GNN accelerators only support accelerating training stage or inference stage, only a few accelerators support holistic process. This is because lots of work make optimizations on partial stage due to the different computation patterns in GNN training and inference. Accelerators supporting training and inference stages in GNN need to consider and design more complicated algorithms and architectures. To be exact, GNN accelerator is different from CNN’s, which only think of forward inference and backward propagation in each iteration. GNN accelerator needs to consider aggregate phase in training, which is another big challenge.

4 Detailed works

From the above mentioned challenges, we will introduce some related existing works to handle the problems in detail. There are three aspects as follows.

4.1 Optimizations on memory

We can see the work Graphite [31], it still uses CPU to accelerate GNN, analyzing the execution behavior on CPU. As is shown in Fig.6, the main reason of performance bottleneck is memory-bound, only 10% of the pipeline slots are attributable to useful work. Further, in 62% of the slots, the pipeline is stalled waiting for loads and stores. Since the aggregate performs a simple reduction for each vertex after gathering its neighbors’ feature vectors, the operation is highly memory-intensive. Therefore, Graphite replaces CPU with DMA(direct memory access) on memory-intense phase, reducing main memory bandwidth pressure appears to be key to optimize GNN workloads on CPUs.
Fig.6 Breakdown of the pipeline slots spent on retiring micro-ops or stalled by different bottlenecks during a full-batch training of GraphSAGE on a CPU [31]

Full size|PPT slide

GNNAdvisor [25], an efficient runtime system to systematically accelerate GNN applications on GPUs, as is shown in Fig.7. The work leverages the performance gains from the GNN input-level properties, such as communities in graphs, node degree and dimensionality of node embedding to implement node renumbering, group-based workload partitioning, and dimension-based workload sharing. These aforementioned optimizations are tailored for GNN computation to balance intra-thread efficiency and inter-thread parallelism, reducing the high-overhead memory operations respectively. Besides, a Modeling and Estimating strategy to automate the optimization process with minor manual efforts is accomplished by a set of performance-related parameters for user tuning flexibility.
Fig.7 Overview of GNNAdvisor [25]

Full size|PPT slide

GNNLab [20], a sample-based GNN training system in a single machine multi-GPU setup. A new factored space sharing design for sample-based GNN training that reduces intra-task resource contention and unleashes inter-task data locality, and solutions to tackling the imbalanced load issues introduced by the factored design. A general GPU-based feature caching scheme, as well as a caching policy based on pre-sampling that is robust to diverse sampling algorithms and GNN datasets, which makes an optimization on GNN GPU memory. The insights of these challenges and characteristics is in Fig.8.
Fig.8 A breakdown of memory usage and data similarity for different stages of the SET model when training OGB-Papers over multiple GPUs (G0, G1,...) with 16 GB of memory each [20]

Full size|PPT slide

Another work SmartSAGE [63], training large-scale GNN using in-storage processing architectures (ISP), is to train large-scale dataset in real productivity and life. State-of-the-art (SOTA) ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM capacity. Hence, they explore the feasibility of utilizing capacity-optimized NVMe SSDs for storing memory-hungry GNN data, which enables large-scale GNN training beyond the limits of main memory size.
To reduce data redundancy in aggregate phase, ReGNN [61], proposes a dynamic redundancy-eliminated neighborhood message passing algorithm for GNNs. The work utilizes the redundancy existing in aggregate phase, as is shown in Fig.9. This work designs a dynamic redundancy-eliminated neighborhood message passing algorithm for GNN to find the redundancy sets in advance, then by data reuse in redundancy sets to reduce memory pressure, ReGNN also uses hardware to support the algorithm. The holistic architecture is shown in Fig.10, which can be reconfigured by adjust the corresponding switches.
Fig.9 Illustrative examples of EdgeUpdate redundancy (ER) and Aggregation redundancy (AR) [61]

Full size|PPT slide

Fig.10 Overview of ReGNN [61]

Full size|PPT slide

I-GCN [30] and GCoD [29] are two typical works using data locality to accelerating GNN. I-GCN uses a mechanism called islandization, which is a new online graph restructuring algorithm clustering of nodes with strong internal but weak external connections. The islandization process yields two major benefits. First, by processing islands rather than individual nodes, there is better on-chip data reuse and fewer off-chip memory accesses. Second, there is less redundant computation as aggregation for common/shared neighbors in an island can be reused. This is done without any preprocessing of the graph data or adjustment of the GCN model structure. Experimental results show that I-GCN can significantly reduce off-chip accesses and prune 38% of aggregation operations. The whole process is in Fig.11.
Fig.11 Overview of I-GCN [30]

Full size|PPT slide

GCoD [29] integrates a split and conquer GCN training strategy, which polarizes the graphs to be either denser or sparser in local neighborhoods without affecting the model accuracy, letting graph adjacency matrices have merely two types of workload and benefits from largely improved regularity and thus ease of acceleration. The overview is in Fig.12. GROW [32] treats GCN as a sparse-dense matrix multiplication, so it utilizes an uniform architecture to inference GCN. The main purpose of this work is to employ a row-stationary dataflow based on the row-wise product matrix multiplication (i.e., Gustavson’s algorithm [67]), allowing both flexible and fine-grained adaptation to the two heterogeneous sparsity patterns, which can address inefficient data movements and significantly reduces the memory bandwidth waste. Besides, it puts forward a multi-row stationary runahead execution model to maximize memory-level parallelism and overall throughput. The architecture of GROW is exhibited in Fig.13.
Fig.12 Overview of GCOD [29]

Full size|PPT slide

Fig.13 Overview of GROW [32]

Full size|PPT slide

SGCN [33] exploits compressed-sparse features existing in input feature to compress data which is suitable for GCN.The work optimizes compressing model not only in aggregate phase most work focus on but also in combine phase to reduce off-chip memory traffic. To better tackle data locality in the existence of the varying sparsity, SGCN proposes sparsity-aware cooperation. The basic view of SGCN is displayed in Fig.14. The work mainly pays attention to GCNs with deeper, more residual layers, which appears a huger potential from the intermediate feature sparsity.
Fig.14 Overview of SGCN [33]

Full size|PPT slide

4.2 Optimizations on computation

The accelerating computation in GNN is focused on scheduling scheme, dataflow, graph tiling, graph partitioning and graph reordering. GraphACT [52], AWB-GCN [26], GCNAX [28], EnGN [54], and GRIP [60] are typical works for these.
GraphACT proposes a GNN training system implemented by a host CPU plus an FPGA gas pedal. The host part, in addition to performing results such as the softmax function and the calculation of the loss function during training, also preprocesses the GNN task to be computed by an elaborate matching algorithm, arranges the data to be computed in a way that is most conducive to reuse, and transfers them to the memory on the FPGA board; the FPGA part contains two major hardware function modules, which perform vector computation and matrix computation, as well as the corresponding on-chip cache. Since the work is used in a training scenario, the preprocessing on the host is able to be performed in parallel with the task computation on the FPGA gas pedal, so it is more efficient. Fig.15 shows the timing flow diagram of the system for the GNN training task.
Fig.15 Per-minibatch scheduling between CPU and FPGA (up), and between FPGA computation modules (down) [52]

Full size|PPT slide

AWB-GCN focuses on the problem of load imbalance in sparse matrix multiplication computation, proposing three techniques of distributed smoothing, remote swapping, and special row remapping for load balancing, and designing the corresponding data path support in hardware, so as to improve the utilization of PE units with a unified structure. Since the sparse matrix of the sparse matrix multiplication computation in GNN is determined by the task, and the sparsity of each task is different, this paper also proposes an auto-tuning module to adjust the deployment strategy of the next cycle by the current cycle execution, so that the hardware utilization can be improved by auto-tuning when the GNN algorithm is run several times on the same data set.
GCNAX proposes a flexible and optimized dataflow for GCNs that simultaneously improves resource utilization and reduces data movement. This is realized by fully exploring the design space of GCN dataflows and evaluating the number of execution cycles and DRAM accesses through an analysis framework. As is shown in Fig.16, GCNAX considers the operations in a GNN layer with opposite order, to further research the dataflow and execution mode of the GNN.
Fig.16 The number of operations for the five datasets (first layer) using the two execution orders [28]

Full size|PPT slide

FlexGraph [21] makes a classification of GNN according to neighborhood definitions and aggregation patterns, indicating training GNN models’ key expressivity and performance challenges of GNN frameworks. The work uses DNFA (direct neighbors with flat aggregation), INFA (indirect neighbors with flat aggregation) and INHA (indirect neighbors with hierarchical aggregation) to define GNN model based on both static and operational characteristics. What’s more, it has a programming abstraction model abstraction NAU (NeighborSelection, Aggregation and Update) which support the above categories. Then it leverages a mixed execution scheme for hierarchical aggregation, as well as an application-driven workload balancing strategy and a pipeline processing strategy for efficient distributed training, which makes an optimization on GNN computation. The entire process is in Fig.17.
Fig.17 FlexGraph architecture [21]

Full size|PPT slide

To accelerate backward propagation’s aggregate computation in GCN training, Xu et al. [35] proposes an execution path preparing approach to reduce computations in training after discovering that GCN training can be equivalently changed to a partially-active graph processing procedure from graph processing’s angle. As is shown in Fig.18. What’s more, it proposes a structure-aware way on computing the optimal group sizes for the execution paths after studying the effective group sizes on performing backward aggregation in different execution path. At the same time, it compares its work with GNNAdvisor in GCN trainings, and it make a better performance.
Fig.18 Execution paths of backward aggregations in two layers on the example graph [35]

Full size|PPT slide

GRIP emphasizes the concept of Nodeflow: in the inference task of GNN, it is usually not necessary to calculate the output of all nodes, but only a few nodes, and from a few output nodes through the connection relations of the graph to invert the input nodes and their connection relations of each layer of GNN, which constitutes a sub-graph of minimum computation. GRIP designs corresponding multiple caches and corresponding prefetching modules to process Nodeflow data, which are sent to multiple units performing aggregation tasks of multiple nodes in parallel by means of cross-array; the node feature transformation part is accelerated by weighted chunking caches, and the overall architecture is shown in Fig.19. This work can support multiple GNN variants by combining the data flow between each intermediate cache, and the spatial reuse of data is achieved as much as possible through multiple caches and computational units, although the cross-array is not optimized enough and temporal reuse of data is less considered.
Fig.19 Overview of GRIP [60]

Full size|PPT slide

To accelerate quantized graph neural networks via GPU Tensor Core. QGTC [24] imports a novel quantized low-bit arithmetic design based on the low-bit data representation and bit-decomposed computation, which can support QGNN with diverse precision demands. It also leverages subgraph partitioning and batching, and zero-tile jumping to optimize computation, built on top of the GPU Tensor Core. The main process in shown in Fig.20. We can see that QGTC make the implementations on software, hardware, GPU kernel, framework and algorithm level.
Fig.20 The main framework of QGTC [24]

Full size|PPT slide

Another quantization work on GNN is Degree-Quant [22]. By quantization-aware training for GNNs, it designs a method according to degree’s characteristics and makes some optimizations. The work discovers and explains the sources of accuracy degradation in GNNs when using lower precision arithmetic. As is shown in Fig.21. Then it analyses how it chooses straight-thought estimator (STE) implementation, node degree and method for tracking quantization statistics during training impacts performance, which is comprehensive in theory and gain good performance in INT8 and INT4 data precision computations.
Fig.21 High-level view of the stochastic element of Degree-Quant. Masked (high in-degree) nodes, in green, operate at full precision, while unmasked nodes (red) operate at reduced precision. High in-degree nodes contribute most to poor gradient estimates, hence they are stochastically masked more often [22]

Full size|PPT slide

SGQuant [23] is a typical work of quantization optimizing on memory. The work investigates the multi-granularity quantization strategy that operates at different levels (components, graph topology, and layers) of GNN computation. Moreover, we offer an automatic bit-selecting (ABS) to pinpoint the most appropriate quantization bits for the above multi-granularity quantizations. As is shown in Fig.22. And it proposes a multi-granularity quantization featured with component-wise, topology-aware, and layer-wise quantization to meet the diverse data precision demands. What’s more, an end-to-end bits selecting in an automatic manner is introduced that makes the most appropriate choice for the aforementioned different quantization granularities.
Fig.22 Multi-Granularity Quantization: (a) Component-wise, (b) Topology-aware, (c) Layer-wise, and (d) Uniform Quantization. NOTE: the same color represents the same quantization bit [23]

Full size|PPT slide

EnGN is a typical representative of unified architecture, and this work proposes a NeuralGraph Processing Unit (NGPU) to perform sparse matrix or dense matrix multiplication computation in GNN. Specifically, NGPU consists of a two-dimensional array of interconnected Processing Element (PE), each PE contains a multiplicative accumulation computation unit, a degree aware feature cache, and a weight cache, which can better exploit data time reuse property. In addition, each column of PEs is connected by a topology called Ring-Edge-Reduce (RER), and it is through this data path that EnGN performs sparse matrix operations, and with the preprocessing of graph data, aggregates feature data to be aggregated through several cycles of data transfer. The architectural diagram of this work is shown in Fig.9. Its advantages are that the degree aware cache reduces the number of accesses and 2D array structure has better circuit characteristics, while its shortcomings are that the aggregation operation of GNN is not well supported and there is an additional overhead for aggregation in the RER structure.
FlowGNN [65] considers the real-time GNN inference, which cannot ensure the workload’s amount, i.e., agnostic workload. The fundamental architecture is depicted in Fig.23. The improved architecture can simultaneously process multiple nodes and edges, enabled by an NT-to-MP adapter via on-the-fly multicasting. FlowGNN analyzes dataflow in GNN model, indicating limitations existing in accelerating architecture. Consequently, it proposes an explicit message passing mechanism that improves the generality of GNN accelerator dramatically and multi-level parallelism (inter-node, intra-node, inter-edge across both message passing and node transformation) via multi-queue-based dataflow that can notably enhance effect without losing generality.
Fig.23 Overview of FlowGNN [65]

Full size|PPT slide

4.3 Optimizations on emerging technologies

The process in memory (PIM) architecture is booming in memory-intensive situation, ReRAM as a representation, is mainly composed of wordline, bitline, DAC, ADC, etc., which can be used in GNN’s aggregate acceleration. As is shown in Fig.24, (a) is the basic construction of ReRAM, and (b) is the work flow of it. This special architecture can be used to accelerate matrix-vector multiplications (MVM) efficiently. The weight matrix is generally put on resistance crossbars, and input data passes through wordline becoming voltage by DAC, then bitline accumulates the current from the product of voltage and electrical conductance, passing the ADC to return to result data. At present, the typical PIM works on GNN are Huang et al. [27], ReGNN [62] and PASGCN [34,50]. Huang et al. [27] proposes a new PIM accelerator called REFLIP. As is shown in Fig.25. First, REFLIP adopts ReRAM crossbar architectures to build a unified architecture for supporting the two GNN phases. Second, REFLIP adopts the flipped-mapping scheme that can reduce the sparsity of adjacent matrix and improve the massive crossbar-structured parallelism, which exchanges the position of weight and input. The scheme is shown in Fig.26.
Fig.24 The ReRAM architecture [28]

Full size|PPT slide

Fig.25 The REFLIP architecture [27]

Full size|PPT slide

Fig.26 Comparing and contrasting the classic graph mapping and our flipped mapping for GCNs. (a) A sample graph, (b) traditional graph processing mapping scheme that maps sparse edge data into crossbar arrays, and (c) the flipped-mapping scheme in REFLIP that maps multi-dimensional vertex features into crossbar arrays. The notation ei,j represents an edge pointing from the vertex i to the vertex j [27]

Full size|PPT slide

ReGNN [62] is composed of analog PIM (APIM) and digital PIM (DPIM) modules for accelerating MVM operations and non-MVM aggregate operations respectively. ReGNN maps data to aggregation sub-engines based on the degree of vertices and the dimension of feature vectors to increase data parallelism, the architecture of ReGNN is shown in Fig.27.
Fig.27 Overview of ReGNN [62]

Full size|PPT slide

Latest works on GNN is PASGCN [34,50], This work fully utilizes parallelism in vertex-intra and vertex-inter by employing dense data mapping and a search-execute architecture with acceptable crossbars cost, as is shown in Fig.28. And there are also two scheduling strategies for maximizing the vertex-inter parallelisms. Optimal scheduling is reduced to a maximum independent set problem, which is solved by an algorithm called node-grouping. Besides, this work explores the task-irrelevant information in graphs and proposes an adaptively sparsified GCN network, which is named as ASparGCN. ASparGCN exploits a multilayer perception (MLP)-based edge predictor to get edge selection strategies for each GCN layer separately and adaptively in training stage, and only inferences with the selected edges in test stage. The whole architecture is shown in Fig.29.
Fig.28 GCN design with the two design patterns. (a) Adjacent matrix; (b) GCN design with general MM (GEMM) crossbars; (c) GCN design with CAM crossbars and MAC crossbars [34]

Full size|PPT slide

Fig.29 Overview of PASGCN [34]

Full size|PPT slide

GraNDe [66] uses near-data processing architecture to map matrix adaptively for GCN. The architecture of it is displayed in Fig.30. The aggregate phase is separated from general processing units by NDP modules, which are located on the DIMM’s buffer chip, one for each rank. The paper locates processing elements (NDP modules) near the DRAM data path to exploit rank-level parallelism, which is at a buffer device located in the middle of the DRAM datapath, multiple ranks can concurrently transfer data to the processing units. Besides, by exploring the data mapping of the operand matrices to DRAM ranks, it is discovered that the optimal mapping effect differs depending on the specific GCN layer’s configuration information.
Fig.30 Overview of GraNDe [66]

Full size|PPT slide

5 Future directions and reviews

On the basis of considerable surveys on GNN accelerators, we learn the characteristics of GNN and challenges when designing GNN accelerators, and related works to address the problems. We make the review of these works. In non-PIM architecture, most works are done on adjacent matrix, which is sparse and following power-low distribution, such as enhancing locality, reducing data redundancy, etc. What’s more, these works also think of imbalance of workload problems on their hardware design or through scheduling schemes to improve the percentage of resource utilization. Besides, there are also works on optimizations on dataflow (nodes flow). In PIM architecture, computation efficiency is optimized by its self construction, so the core works are on parallelism and balanced resource utilization. To be exact, main works are on aggregate phase, which is the performance and energy consumption bottleneck of GNN.
From our perspective, there are some aspects we think being done later.
● Previous hardware accelerating works mainly accelerate only one of the two phase in GNN, especially aggregate phase, which is key bottleneck of GNN. We should build a holistic view to design a GNN hardware accelerators supporting all phases, apart from between the two phases. The sentence “Within these stages” includes optimizations on aggregation, combination and both themselves. The sentence “Between these stages” means optimizations on the dataflow, communications and buffer mechanisms, in other words, between aggregation and combination.
● Data compression [22-24]. Compressing the data can bring notable performance improvement. And we can explore the data precision and model prunings on GNN’s effect.
● Hardware accelerating method of traditional platforms. Although most works are on CPU [31] or GPU [64] these traditional platforms, there are still only a few works on hardware aspects, we can still explore the balanced workload, performance-resource model [56,58], and substitute elements compensating for the drawback of traditional algorithm and software platforms.
● Further optimizations on adjacent matrix. We can explore the adjacent matrix locality further using more accurate method to keep the useful information.
● Explorations on PIM. ReRAM is a successful work on GNN platform. Therefore, we can explore more works on PIM architecture specially for GNN accelerations.
● Dataset characteristics analysis. From previous works, there are some works [25,29] on dataset processing. But there still have more characteristics to be excavated in GNN datasets, for example, real social network datasets following power-law distribution or other datasets which should be further searched for And we think we can optimize GNN on various realistic dataset.

6 Conclusions

This paper first introduces the basics of graph neural network applications, common algorithms, application scenarios, programming models, and mainstream frameworks and extension libraries based on common platforms. Then it presents a detailed and systematic analysis and introduction of the key optimization techniques in this field from the overall structure design, computation, memory, and emerging technologies, taking the accelerated structure design challenges brought by the execution behavior of graph neural networks as starting point. To face the challenges, we introduced the related works to handle these issues. What's more, this paper presents the key optimization techniques and future directions in the design of accelerated structures for GNN, and provides a clear understanding of current state of research in this field and inspires researchers in the design of accelerated structures.

Jingyu Liu received the master degree in integrated circuit engineering from National University of Defense Technology, China in 2021. He is now working towards the PhD degree with the School of Computer, National University of Defense Technology, China. His research interests include computer architecture and graph-based hardware accelerator

Shi Chen received the bachelor degree in Computer Science & Technology from National University of Defense Technology, China in 2021. He is now working towards the PhD degree with the School of Computer, National University of Defense Technology, China. His research interests include computer architecture and graph-based hardware accelerator

Li Shen received the BS and PhD degrees in Computer Science & Technology from National University of Defense Technology, China. Currently he is a professor at School of Computer, National University of Defense Technology, China. His research interests include high performance processor architecture, parallel programming, and performance optimization techniques

References

[1]
Cao S, Lu W, Xu Q. Deep neural networks for learning graph representations. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 1145−1152
[2]
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018
[3]
You J, Ying R, Ren X, Hamilton W L, Leskovec J. GraphRNN: a deep generative model for graphs. 2018, arXiv preprint arXiv: 1802.08773
[4]
Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? In: Proceedings of the 7th International Conference on Learning Representations. 2019
[5]
Wu Z, Pan S, Chen F, Long G, Zhang C, Yu P S . A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32( 1): 4–24
[6]
Hamilton W L, Ying Z, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1025−1035
[7]
Ying R, He R, Chen K, Eksombatchai P, Hamilton W L, Leskovec J. Graph convolutional neural networks for web-scale recommender systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 974−983
[8]
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017
[9]
Gao H, Wang Z, Ji S. Large-scale learnable graph convolutional networks. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 1416−1424
[10]
Li R, Wang S, Zhu F, Huang J. Adaptive graph convolutional neural networks. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI 18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI 18). 2018, 434
[11]
Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? 2018, arXiv preprint arXiv: 1810.00826
[12]
Zhang M, Cui Z, Neumann M, Chen Y. An end-to-end deep learning architecture for graph classification. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). 2018, 4438−4445
[13]
Yin L, Wang J, Zheng H. Exploring architecture, dataflow, and sparsity for gcn accelerators: a holistic framework. In: Proceedings of the Great Lakes Symposium on VLSI 2023. 2023, 489–495
[14]
Garg R, Qin E, Munoz-Matrinez F, Guirado R, Jain A, Abadal S, Abellan J L, Acacio M E, Alarcon E, Rajamanickam S, Krishna T. Understanding the design space of sparse/dense multiphase gnn dataflows on spatial accelerators. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium. 2022, 571–582
[15]
Hamaguchi T, Oiwa H, Shimbo M, Matsumoto Y. Knowledge transfer for out-of-knowledge-base entities: a graph neural network approach. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 1802−1808
[16]
Schlichtkrull M, Kipf T N, Bloem P, van den Berg R, Titov I, Welling M. Modeling relational data with graph convolutional networks. In: Proceedings of the 15th International Conference on the Semantic Web. 2018, 593−607
[17]
Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M . Graph neural networks: a review of methods and applications. AI Open, 2020, 1: 57–81
[18]
Ma L, Yang Z, Miao Y, Xue J, Wu M, Zhou L, Dai Y. NeuGraph: parallel deep neural network computation on large graphs. In: Proceedings of 2019 USENIX Annual Technical Conference. 2019, 443−458
[19]
Yan M, Chen Z, Deng L, Ye X, Zhang Z, Fan D, Xie Y . Characterizing and understanding GCNs on GPU. IEEE Computer Architecture Letters, 2020, 19( 1): 22–25
[20]
Yang J, Tang D, Song X, Wang L, Yin Q, Chen R, Yu W, Zhou J. GNNLab: a factored system for sample based GNN training over GPUs. In: Proceedings of the 17th European Conference on Computer Systems. 2022, 417−434
[21]
Wang L, Yin Q, Tian C, Yang J, Chen R, Yu W, Yao Z, Zhou J. FlexGraph: a flexible and efficient distributed framework for GNN training. In: Proceedings of the 16th European Conference on Computer Systems. 2021, 67−82
[22]
Tailor S A, Fernández-Marqués J, Lane N D. Degree-quant: quantization-aware training for graph neural networks. In: Proceedings of the 9th International Conference on Learning Representations. 2021
[23]
Feng B, Wang Y, Li X, Yang S, Peng X, Ding Y. SGQuant: squeezing the last bit on graph neural networks with specialized quantization. In: Proceedings of the 32nd IEEE International Conference on Tools with Artificial Intelligence. 2020, 1044−1052
[24]
Wang Y, Feng B, Ding Y. QGTC: accelerating quantized graph neural networks via GPU tensor core. In: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2022, 107−119
[25]
Wang Y, Feng B, Li G, Li S, Deng L, Xie Y, Ding Y. GNNAdvisor: an adaptive and efficient runtime system for GNN acceleration on GPUs. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 515−531
[26]
Geng T, Li A, Shi R, Wu C, Wang T, Li Y, Haghi P, Tumeo A, Che S, Reinhardt S, Herbordt M C. AWB-GCN: a graph convolutional network accelerator with runtime workload rebalancing. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 2020, 922−936
[27]
Huang Y, Zheng L, Yao P, Wang Q, Liao X, Jin H, Xue J. Accelerating graph convolutional networks using crossbar-based processing-in-memory architectures. In: Proceedings of IEEE International Symposium on High-Performance Computer Architecture. 2022, 1029−1042
[28]
Li J, Louri A, Karanth A, Bunescu R C. GCNAX: a flexible and energy-efficient accelerator for graph convolutional neural networks. In: Proceedings of IEEE International Symposium on High-Performance Computer Architecture. 2021, 775−788
[29]
You H, Geng T, Zhang Y, Li A, Lin Y. GCoD: graph convolutional network acceleration via dedicated algorithm and accelerator Co-design. In: Proceedings of IEEE International Symposium on High-Performance Computer Architecture. 2022, 460−474
[30]
Geng T, Wu C, Zhang Y, Tan C, Xie C, You H, Herbordt M, Lin Y, Li A. I-GCN: a graph convolutional network accelerator with runtime locality enhancement through islandization. In: Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture. 2021, 1051−1063
[31]
Gong Z, Ji H, Yao Y, Fletcher C W, Hughes C J, Torrellas J. Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques. In: Proceedings of the 49th Annual International Symposium on Computer Architecture. 2022, 916−931
[32]
Hwang R, Kang M, Lee J, Kam D, Lee Y, Rhu M. GROW: a row-stationary sparse-dense GEMM accelerator for memory-efficient graph convolutional neural networks. In: Proceedings of IEEE International Symposium on High-Performance Computer Architecture. 2023, 42−55
[33]
Yoo M, Song J, Lee J, Kim N, Kim Y, Lee J. SGCN: exploiting compressed-sparse features in deep graph convolutional network accelerators. In: Proceedings of IEEE International Symposium on High-Performance Computer Architecture. 2023, 1−14
[34]
Yang T, Li D, Ma F, Song Z, Zhao Y, Zhang J, Liu F, Jiang L . PASGCN: an ReRAM-based PIM design for GCN with adaptively sparsified graphs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42( 1): 150–163
[35]
Xu S, Shao Z, Yang C, Liao X, Jin H . Accelerating backward aggregation in GCN training with execution path preparing on GPUs. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 12): 4891–4902
[36]
Gui C, Zheng L, He B, Liu C, Chen X, Liao X, Jin H . A survey on graph processing accelerators: challenges and opportunities. Journal of Computer Science and Technology, 2019, 34( 2): 339–371
[37]
Roy A, Mihailovic I, Zwaenepoel W. X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the 24th Symposium on Operating Systems Principles. 2013, 472−488
[38]
Perozzi B, Al-Rfou R, Skiena S. DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014, 701−710
[39]
Ozdal M M, Yesil S, Kim T, Ayupov A, Greth J, Burns S, Ozturk O. Energy efficient architecture for graph analytics accelerators. In: Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture. 2016, 166−177
[40]
Zhang M, Zhuo Y, Wang C, Gao M, Wu Y, Chen K, Kozyrakis C, Qian X. GraphP: reducing communication for PIM-based graph processing with efficient data partition. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2018, 544−557
[41]
Song L, Zhuo Y, Qian X, Li H, Chen Y. GraphR: accelerating graph processing using ReRAM. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2018, 531−543
[42]
Xie C, Yan L, Li W J, Zhang Z. Distributed power-law graph computing: theoretical and empirical analysis. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 1673−1681
[43]
Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. 2012, 17−30
[44]
Heidari S, Simmhan Y, Calheiros R N, Buyya R . Scalable graph processing frameworks: a taxonomy and open challenges. ACM Computing Surveys, 2019, 51( 3): 60
[45]
Ham T J, Wu L, Sundaram N, Satish N, Martonosi M. Graphicionado: a high-performance and energy-efficient accelerator for graph analytics. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. 2016, 1−13
[46]
Yan M, Hu X, Li S, Basak A, Li H, Ma X, Akgun I, Feng Y, Gu P, Deng L, Ye X, Zhang Z, Fan D, Xie Y. Alleviating irregularity in graph analytics acceleration: a hardware/software Co-design approach. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 2019, 615−628
[47]
Zhang S, Qin Z, Yang Y, Shen L, Wang Z . Transparent partial page migration between CPU and GPU. Frontiers of Computer Science, 2020, 14( 3): 143101
[48]
Fan S, Fei J, Shen L . Accelerating deep learning with a parallel mechanism using CPU + MIC. International Journal of Parallel Programming, 2018, 46( 4): 660–673
[49]
Yan M, Deng L, Hu X, Liang L, Feng Y, Ye X, Zhang Z, Fan D, Xie Y. HyGCN: a GCN accelerator with hybrid architecture. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2020, 15−29
[50]
Yang T, Li D, Han Y, Zhao Y, Liu F, Liang X, He Z, Jiang L. PIMGCN: a ReRAM-based PIM design for graph convolutional network acceleration. In: Proceedings of the 58th ACM/IEEE Design Automation Conference. 2021, 583−588
[51]
Yang H. AliGraph: a comprehensive graph neural network platform. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019, 3165−3166
[52]
Zeng H, Prasanna V K. GraphACT: accelerating GCN training on CPU-FPGA heterogeneous platforms. In: Proceedings of 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020, 255−265
[53]
Kung H T. Why systolic architectures? Computer, 1982, 15(1): 37−46
[54]
Liang S, Wang Y, Liu C, He L, Li H, Xu D, Li X . EnGN: a high-throughput and energy-efficient accelerator for large graph neural networks. IEEE Transactions on Computers, 2021, 70( 9): 1511–1525
[55]
Auten A, Tomei M, Kumar R. Hardware acceleration of graph neural networks. In: Proceedings of the 57th ACM/IEEE Design Automation Conference. 2020, 1−6
[56]
Liang S, Liu C, Wang Y, Li H, Li X. DeepBurning-GL: an automated framework for generating graph neural network accelerators. In: Proceedings of IEEE/ACM International Conference on Computer Aided Design. 2020, 72
[57]
Song X, Zhi T, Fan Z, Zhang Z, Zeng X, Li W, Hu X, Du Z, Guo Q, Chen Y . Cambricon-G: a polyvalent energy-efficient accelerator for dynamic graph neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022, 41( 1): 116–128
[58]
Zhou Z, Shi B, Zhang Z, Guan Y, Sun G, Luo G. BlockGNN: towards efficient GNN acceleration using block-circulant weight matrices. In: Proceedings of the 58th ACM/IEEE Design Automation Conference. 2021, 1009−1014
[59]
Arka A I, Doppa J R, Pande P P, Joardar B K, Chakrabarty K. RegraphX: NoC-enabled 3D heterogeneous ReRAM architecture for training graph neural networks. In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition. 2021, 1667−1672
[60]
Kiningham K, Levis P, Ré C . GRIP: a graph neural network accelerator architecture. IEEE Transactions on Computers, 2023, 72( 4): 914–925
[61]
Chen C, Li K, Li Y, Zou X. ReGNN: a redundancy-eliminated graph neural networks accelerator. In: Proceedings of IEEE International Symposium on High-Performance Computer Architecture. 2022, 429−443
[62]
Liu C, Liu H, Jin H, Liao X, Zhang Y, Duan Z, Xu J, Li H. ReGNN: a ReRAM-based heterogeneous architecture for general graph neural networks. In: Proceedings of the 59th ACM/IEEE Design Automation Conference. 2022, 469−474
[63]
Lee Y, Chung J, Rhu M. SmartsAGE: training large-scale graph neural networks using in-storage processing architectures. In: Proceedings of the 49th Annual International Symposium on Computer Architecture. 2022, 932−945
[64]
Sun Q, Liu Y, Yang H, Zhang R, Dun M, Li M, Liu X, Xiao W, Li Y, Luan Z, Qian D. CoGNN: efficient scheduling for concurrent GNN training on GPUs. In: Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis. 2022, 39
[65]
Sarkar R, Abi-Karam S, He Y, Sathidevi L, Hao C. FlowGNN: a dataflow architecture for real-time workload-agnostic graph neural network inference. In: Proceedings of the 29th IEEE International Symposium on High-Performance Computer Architecture. 2023, 1099−1112
[66]
Yun S, Kim B, Park J, Nam H, Ahn J H, Lee E . GraNDe: near-data processing architecture with adaptive matrix mapping for graph convolutional networks. IEEE Computer Architecture Letters, 2022, 21( 2): 45–48
[67]
Gustavson F G . Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Transactions on Mathematical Software, 1978, 4( 3): 250–269

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 62032001 and 61972407) .

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press
AI Summary AI Mindmap
PDF(16089 KB)

1203

Accesses

0

Citations

Detail

Sections
Recommended

/