【阅读】A Comprehensive Survey on Distributed Training of Graph Neural Networks——翻译

小锋学长生活大爆炸

已于 2023-02-28 10:33:01 修改

阅读量1w

点赞数 1

分类专栏：论文文献文章标签：人工智能分布式深度学习 GNN 图神经网络

于 2023-01-20 01:56:18 首次发布

本文链接：https://blog.csdn.net/sxf1061700625/article/details/128738944

版权

论文文献专栏收录该内容

12 篇文章 2 订阅

订阅专栏

文章全面探讨了分布式图神经网络（GNN）训练的方法，包括工作流程、计算与通信模式，及优化技术。通过对比分布式全批量与小批量训练，分析了不同执行策略与软件硬件平台的支持情况。同时，与分布式深度神经网络训练的差异突显了GNN独特性，并讨论了潜在问题与未来方向。

摘要由CSDN通过智能技术生成

转载请注明出处：小锋学长生活大爆炸[xfxuezhang.cn]

(本文中，涉及到公式部分的翻译不准确，请看对应原文。)

另一篇：【阅读】Distributed Graph Neural Network Training: A Survey

C. 多CPU硬件平台 V.S. 多GPU硬件平台

VIII. 与分布式DNN训练的比较

A. 分布式DNN训练简介

B. 分布式全批量训练 V.S. DNN模型并行性

C. 分布式小批量训练 V.S. DNN数据并行

Abstract

Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.

图形神经网络（GNN）在广泛的应用领域中被证明是一种强大的算法模型，因为其在图形学习中的有效性。为了将GNN训练扩展到大规模和不断增长的图形，最有希望的解决方案是分布式训练，它将训练的工作量分布在多个计算节点上。然而，分布式GNN训练的工作流程、计算模式、通信模式和优化技术仍有初步了解。在本文中，我们通过研究分布式GNN训练中使用的各种优化技术，对分布式GNN训练进行了全面的调查。首先，分布式GNN训练根据其工作流程分为几个类别。此外，还介绍了它们的计算模式和通信模式，以及最近工作提出的优化技术。其次，还介绍了分布式GNN训练的软件框架和硬件平台，以加深理解。第三，将分布式GNN训练与深度神经网络的分布式训练进行了比较，强调了分布式GNN训练的独特性。最后，讨论了该领域的有趣问题和机遇。

Index Terms

Graph learning, graph neural network, distributed training, workflow, computational pattern, communication pattern, optimization technique, software framework.

图形学习、图形神经网络、分布式训练、工作流、计算模式、通信模式、优化技术、软件框架。

INTRODUCTION

GRAPH is a well-known data structure widely used in many critical application fields due to its powerful representation capability of data, especially in expressing the associations between objects [1], [2]. Many real-world data can be naturally represented as graphs which consist of a set of vertices and edges. Take social networks as an example [3], [4], the vertices in the graph represent people and the edges represent interactions between people on Facebook [5]. An illustration of graphs for social networks is illustrated in Fig. 1 (a), where the circles represent the vertices, and the arrows represent the edges. Another well-known example is knowledge graphs [6], [7], in which the vertices represent entities while the edges represent relations between the entities [8].

GRAPH是一种众所周知的数据结构，由于其强大的数据表示能力，特别是在表达对象之间的关联[1]、[2]时，被广泛应用于许多关键应用领域。许多真实世界的数据可以自然地表示为由一组顶点和边组成的图形。以社交网络为例[3]，[4]，图中的顶点表示人，边表示Facebook上人与人之间的交互[5]。图1（a）说明了社交网络的图表，其中圆圈表示顶点，箭头表示边缘。另一个众所周知的例子是知识图[6]，[7]，其中顶点表示实体，而边表示实体之间的关系[8]。

Graph neural networks (GNNs) demonstrate superior performance compared to other algorithmic models in learning over graphs [9]–[11]. Deep neural networks (DNNs) have been widely used to analyze Euclidean data such as images [12]. However, they have been challenged by graph data from the non-Euclidean domain due to the arbitrary size and complex topological structure of graphs [13]. Besides, a major weakness of deep learning paradigms identified by industry is that they cannot effectively carry out causal reasoning, which greatly reduces the cognitive ability of intelligent systems [14]. To this end, GNNs have emerged as the premier paradigm for graph learning and endowed intelligent systems with cognitive ability. An illustration for GNNs is shown in Fig. 1 (b). After getting graph data as input, GNNs use forward propagation and backward propagation to update the model parameters. Then the trained model can be applied to graph tasks, including vertex prediction [15] (predicting the properties of specific vertices), link prediction [16] (predicting the existence of an edge between two vertices), and graph prediction [17] (predicting the properties of the whole graph), as shown in Fig. 1 (c).

与其他算法模型相比，图形神经网络（GNN）在图形学习方面表现出了优异的性能[9]-[11]。深度神经网络（DNN）已广泛用于分析图像等欧几里得数据[12]。然而，由于图的任意大小和复杂拓扑结构，它们受到了来自非欧几里得域的图数据的挑战[13]。此外，工业界发现的深度学习范式的一个主要弱点是它们不能有效地进行因果推理，这大大降低了智能系统的认知能力[14]。为此，GNN已成为图形学习的首要范式，并赋予智能系统认知能力。GNN的图示如图1（b）所示。在获得图形数据作为输入后，GNN使用正向传播和反向传播来更新模型参数。然后，训练后的模型可以应用于图形任务，包括顶点预测[15]（预测特定顶点的属性）、链接预测[16]（预测两个顶点之间的边的存在）和图形预测[17]（预测整个图形的属性），如图1（c）所示。

Thanks to the superiority of GNNs, they have been widely used in various real-world applications in many critical fields. These real-world applications include knowledge inference [18], natural language processing [19], [20], machine translation [21], recommendation systems [22]–[24], visual reasoning [25], chip design [26]–[28], traffic prediction [29]– [31], ride-hailing demand forecasting [32], spam review detection [33], molecule property prediction [34], and so forth. GNNs enhance the machine intelligence when processing a broad range of real-world applications, such as giving>50% accuracy improvement for real-time ETAs in Google Maps [29], generating >40% higher-quality recommendations in Pinterest [22], achieving >10% improvement of ride-hailing demand forecasting in Didi [32], improving >66.90% of recall at 90% precision for spam review detection in Alibaba [33].

由于GNN的优越性，它们已广泛应用于许多关键领域的各种现实应用中。这些实际应用包括知识推理[18]、自然语言处理[19]、[20]、机器翻译[21]、推荐系统[22]–[24]、视觉推理[25]、芯片设计[26]–[28]、交通预测[29]–[31]、叫车需求预测[32]、垃圾邮件审查检测[33]、分子属性预测[34]等。GNN在处理广泛的现实世界应用时增强了机器智能，例如，在谷歌地图中实时ETA的准确度提高了50%以上[29]，在Pinterest中生成了40%以上的高质量推荐[22]，在滴滴中实现了约车需求预测提高了10%以上[32]，阿里巴巴垃圾邮件审查检测的召回率提高了>66.90%，准确率为90%[33]。

However, both industry and academia are still eagerly expecting the acceleration of GNN training for the following reasons [35]–[38]:

然而，工业界和学术界仍热切期待GNN训练的加速，原因如下[35]–[38]：

1. The scale of graph data is rapidly expanding, consuming a great deal of time for GNN training. With the explosion of information on the Internet, new graph data are constantly being generated and changed, such as the establishment and demise of interpersonal relationships in social communication and the changes in people's preferences for goods in online shopping. The scales of vertices and edges in graphs are approaching or even outnumbering the order of billions and trillions, respectively [39]–[42]. The growth rate of graph scales is also astonishing. For example, the number of vertices (i.e., users) in Facebook's social network is growing at a rate of 17% per year [43]. Consequently, GNN training time dramatically increases due to the ever-growing scale of graph data.

1. 图形数据的规模正在迅速扩大，为GNN训练消耗了大量时间。随着互联网上信息的爆炸，新的图表数据不断产生和变化，例如社会交往中人际关系的建立和消亡，以及人们在网上购物中对商品偏好的变化。图中顶点和边的比例分别接近或超过数十亿和万亿量级[39]–[42]。图表规模的增长速度也惊人。例如，Facebook社交网络中的顶点（即用户）数量以每年17%的速度增长[43]。因此，由于图形数据的规模不断增长，GNN训练时间显著增加。

2. Swift development and deployment of novel GNN models involves repeated training, in which a large amount of training time is inevitable. Much experimental work is required to develop a highly-accurate GNN model since repeated training is needed [9]–[11]. Moreover, expanding the usage of GNN models to new application fields also requires much time to train the model with real-life data. Such a sizeable computational burden calls for faster methods of training.

2. 快速开发和部署新型GNN模型涉及重复训练，其中大量的训练时间是不可避免的。由于需要重复训练[9]-[11]，因此开发高精度GNN模型需要大量实验工作。此外，将GNN模型的使用扩展到新的应用领域也需要大量时间来使用真实数据训练模型。如此庞大的计算负担需要更快的训练方法。

Distributed training is a popular solution to speed up GNN training [35]–[38], [40], [44]–[58]. It tries to accelerate the entire computing process by adding more computing resources, or "nodes", to the computing system with parallel execution strategies, as shown in Fig. 1 (d). NeuGraph [44], proposed in 2019, is the first published work of distributed GNN training. Since then, there has been a steady stream of attempts to improve the efficiency of distributed GNN training in recent years with significantly varied optimization techniques, including workload partitioning [44]–[47], transmission planning [37], [44]–[46], caching strategy [35], [51], [52], etc.

分布式训练是加快GNN训练的流行解决方案[35]–[38]，[40]，[44]–[58]。它试图通过向具有并行执行策略的计算系统添加更多计算资源或“节点”来加速整个计算过程，如图1（d）所示。2019年提出的NeuGraph[44]是分布式GNN训练的第一个已发表的工作。从那时起，近年来有大量尝试通过显著不同的优化技术来提高分布式GNN训练的效率，包括工作负载划分[44]–[47]、传输规划[37]、[44]–[56]、缓存策略[35]、[51]、[52]等。

NeuGraph[44]是分布式GNN训练的第一个已发表的工作

Despite the aforementioned efforts, there is still a dearth of review of distributed GNN training. The need for management and cooperation among multiple computing nodes leads to a different workflow, resulting in complex computational and communication patterns, and making it a challenge to optimize distributed GNN training. However, regardless of plenty of efforts on this aspect have been or are being made, there are hardly any surveys on these challenges and solutions. Current surveys mainly focus on GNN models and hardware accelerators [9]–[11], [59]–[62], but they are not intended to provide a careful taxonomy and a general overview of distributed training of GNNs, especially from the perspective of the workflows, computational patterns, communication patterns, and optimization techniques.

尽管作出了上述努力，但仍缺乏对分布式GNN训练的审查。对多个计算节点之间的管理和合作的需求导致了不同的工作流，导致了复杂的计算和通信模式，并使优化分布式GNN训练成为一个挑战。然而，尽管在这方面已经或正在作出大量努力，但几乎没有任何关于这些挑战和解决方案的调查。当前的调查主要集中于GNN模型和硬件加速器[9]-[11]，[59]–[62]，但它们并非旨在提供GNN分布式训练的详细分类和一般概述，尤其是从工作流、计算模式、通信模式和优化技术的角度。

GNN模型和硬件加速器[9]-[11]，[59]–[62]

This paper presents a comprehensive review of distributed training of GNNs by investigating its various optimization techniques. First, we summarize distributed GNN training into several categories according to their workflows. In addition, we introduce their computational patterns, communication patterns, and various optimization techniques proposed in recent works to facilitate scholars to quickly understand the principle of recent optimization techniques and the current research status. Second, we introduce the prevalent software frameworks and hardware platforms for distributed GNN training and their respective characteristics. Third, we emphasize the uniqueness of distributed GNN training by comparing it with distributed DNN training. Finally, we discuss interesting issues and opportunities in this field.

本文通过研究GNN的各种优化技术，全面回顾了GNN的分布式训练。首先，我们根据工作流程将分布式GNN训练归纳为几个类别。此外，我们还介绍了它们的计算模式、通信模式以及最近工作中提出的各种优化技术，以帮助学者快速了解最近优化技术的原理和当前的研究现状。其次，我们介绍了用于分布式GNN训练的流行软件框架和硬件平台及其各自的特点。第三，通过与分布式DNN训练的比较，我们强调了分布式GNN训练的独特性。最后，我们讨论了这个领域的有趣问题和机遇。

Our main goals are as follows:

我们的主要目标如下：

• Introducing the basic concepts of distributed GNN training.

• 介绍分布式GNN训练的基本概念。

• Analyzing the workflows, computational patterns, and communication patterns of distributed GNN training and summarizing the optimization techniques.

• 分析分布式GNN训练的工作流程、计算模式和通信模式，总结优化技术。

• Highlighting the differences between distributed GNN training and distributed DNN training.

• 强调分布式GNN训练和分布式DNN训练之间的差异。

• Discussing interesting issues and opportunities in the field of distributed GNN training.

• 讨论分布式GNN训练领域的有趣问题和机会。

The rest of this paper is described in accordance with these goals. Its organization is shown in Table I and summarized as follows:

本文的其余部分将根据这些目标进行描述。其组织结构见表一，总结如下：

• Section II introduces the basic concepts of graphs and GNNs as well as two training methods of GNNs: fullbatch training and mini-batch training.

• 第二节介绍了图形和GNN的基本概念以及GNN的两种训练方法：全批量训练和小批量训练。

• Section III introduces the taxonomy of distributed GNN training and makes a comparison between them.

• 第三节介绍了分布式GNN训练的分类，并对其进行了比较。

• Section IV introduces distributed full-batch training of GNNs and further categorizes it into two types, of which the workflows, computational patterns, communication patterns, and optimization techniques are introduced in detail.

• 第四节介绍了GNN的分布式全批量训练，并将其进一步分类为两种类型，其中详细介绍了工作流、计算模式、通信模式和优化技术。

• Section V introduces distributed mini-batch training and classifies it into two types. We also present the workflow, computational pattern, communication pattern, and optimization techniques for each type.

• 第五节介绍了分布式小批量培训，并将其分为两种类型。我们还介绍了每种类型的工作流、计算模式、通信模式和优化技术。

• Section VI introduces software frameworks currently supporting distributed GNN training and describes their characteristics.

• 第六节介绍了目前支持分布式GNN训练的软件框架，并描述了其特点。

• Section VII introduces hardware platforms for distributed GNN training.

• 第七节介绍了分布式GNN培训的硬件平台。

• Section VIII highlights the uniqueness of distributed GNN training by comparing it with distributed DNN training.

• 第八节通过将分布式GNN训练与分布式DNN训练进行比较，强调了分布式GNN培训的独特性。

• Section IX summarizes the distributed full-batch training and distributed mini-batch training, and discusses several interesting issues and opportunities in this field.

• 第九节总结了分布式全批量训练和分布式小批量训练，并讨论了该领域的几个有趣问题和机会。

• Section X is the conclusion of this paper.

• 第十节是本文的结论。

BACKGROUND

This section provides some background concepts of graphs as well as GNNs, and introduces the two training methods applied in GNN training: full-batch and mini-batch.

本节提供了图形和GNN的一些背景概念，并介绍了GNN训练中应用的两种训练方法：全批和小批。

A. Graphs

A graph is a type of data structure consisting of vertices and edges. Its flexible structure can effectively express the relationship among a set of objects with arbitrary size and undetermined structure, which is called non-Euclidean data. As shown in Fig. 2 (a) and (b), the form of non-Euclidean data is not as highly structured as Euclidean data. However, by using vertices to represent the objects and edges to represent relationships among the objects, graphs can effectively represent non-Euclidean data, such as social networks and knowledge graphs.

图是一种由顶点和边组成的数据结构。其灵活的结构可以有效地表达一组具有任意大小和不确定结构的对象之间的关系，称为非欧几里得数据。如图2（a）和（b）所示，非欧几里得数据的形式不像欧几里得数据那样高度结构化。然而，通过使用顶点来表示对象，使用边来表示对象之间的关系，图可以有效地表示非欧几里得数据，例如社交网络和知识图。

There are mainly three taxonomies of graphs:

图形主要有三种分类法：

Directed/Undirected Graphs: Every edge in a directed graph has a fixed direction, indicating that the connection is only from a source vertex to a destination vertex. However, in an undirected graph, the connection represented by an edge is bi-directional between the two vertices. An undirected graph can be transformed into a directed one, in which two edges in the opposite direction represent an undirected edge in the original graph.

有向/无向图：有向图中的每条边都有一个固定的方向，表示连接仅从源顶点到目标顶点。然而，在无向图中，由边表示的连接在两个顶点之间是双向的。一个无向图可以转化为一个有向图，其中两条相反方向的边表示原始图中的一条无向边。

Homogeneous/Heterogeneous Graphs: Homogeneous graphs contain a single type of vertex and edge, while heterogeneous graphs contain multiple types of vertices and multiple types of edges. Thus, heterogeneous graphs are more powerful in expressing the relationships between different objects.

同构/异构图：同构图包含单一类型的顶点和边，而异构图包含多种类型的顶点以及多种类型的边。因此，异构图在表达不同对象之间的关系方面更为强大。

Static/Dynamic Graphs: The structure and the feature of static graphs are always unchanged, while those of dynamic graphs can change over time. A dynamic graph can be represented by a series of static graphs with different timestamps.

静态/动态图：静态图的结构和特性始终不变，而动态图的结构与特性可以随时间变化。动态图可以由一系列具有不同时间戳的静态图表示。

B. Graph Neural Networks

GNNs have been dominated to be a promising algorithmic model for learning knowledge from graph data [63]–[68]. It takes the graph data as input and learns a representation vector for each vertex in the graph. The learned representation can be used for down-stream tasks such as vertex prediction [15], link prediction [16], and graph prediction [17].

GNN一直被认为是从图形数据中学习知识的一种有前途的算法模型[63]–[68]。它将图形数据作为输入，并学习图形中每个顶点的向量表示。学习的表示可用于下游任务，如顶点预测[15]、链接预测[16]和图预测[17]。

As illustrated in Fig. 3, a GNN model consists of one or multiple layers consisting of neighbor aggregation and neural network operations, referred to as the Aggregation step and Combination step, respectively. In the Aggregation step, the Aggregate function Aggregate( ) is used to aggregate the feature vectors of in-coming neighboring vertices from the previous GNN layer for each target vertex. For example, in Fig. 3, vertex 4 would gather the feature vectors of itself and its incoming neighboring vertices (i.e., vertex 2, 5, 8) using the Aggregate function. In the Combination step, the Combine function Combine( ) transforms the aggregated feature vector of each vertex using neural network operations. To sum up, the aforementioned computation on a graph G(V, E) can be formally expressed by

如图3所示，GNN模型由一个或多个层组成，包括邻居聚合和神经网络操作，分别称为聚合步骤和组合步骤。在聚合步骤中，聚合函数Aggregate()用于为每个目标顶点聚合来自前一个GNN层的传入相邻顶点的特征向量。例如，在图3中，顶点4将使用聚合函数收集其自身及其传入相邻顶点（即顶点2、5、8）的特征向量。在组合步骤中，组合函数Combine()使用神经网络操作变换每个顶点的聚合特征向量。综上所述，图G（V，E）上的上述计算可以形式化地表示为

where hl v denotes the feature vector of vertex v at the l-th layer, and N (v) represents the neighbors of vertex v. Specifically, the input feature of vertex v ∈ V is denoted as h0v .

其中，hl v表示第l层顶点v的特征向量，N（v）表示顶点v的邻居。具体而言，顶点v∈v的输入特征表示为h0v。

C. Training Methods for Graph Neural Networks

In this subsection, we introduce the training methods for GNNs, which are approached in two ways including full-batch training [69], [70] and mini-batch training [13], [71]–[74].

在本小节中，我们介绍了GNN的训练方法，这两种方法包括全批训练[69]，[70]和小批训练[13]，[71]–[74]。

A typical training procedure of neural networks, including GNNs, includes forward propagation and backward propagation. In forward propagation, the input data is passed through the layers of neural networks towards the output. Neural networks generate differences of the output of forward propagation by comparing it to the predefined labels. Then in backward propagation, these differences are propagated through the layers of neural networks in the opposite direction, generating gradients to update the model parameters.

神经网络（包括GNN）的典型训练过程包括正向传播和反向传播。在正向传播中，输入数据通过神经网络层传递到输出端。神经网络通过将正向传播的输出与预定义的标签进行比较来产生正向传播输出的差异。然后在反向传播中，这些差异以相反的方向通过神经网络层传播，生成梯度以更新模型参数。

As illustrated in Fig. 4, training methods of GNNs can be classified into full-batch training [69], [70] and mini-batch training [13], [71]–[74], depending on whether the whole graph is involved in each round. Here, we define a round of full-batch training consisting of a model computation phase, including forward and backward propagation, and a parameter update phase. On the other hand, a round in mini-batch training additionally contains a sampling phase, which samples a small-sized workload required for the subsequent model computation and thus locates prior to the other two phases. Thus, an epoch, which is defined as an entire pass of the data, is equivalent to a round of full-batch training, while that in mini-batch training usually contains several rounds. Details of these two methods are introduced below.

如图4所示，GNN的训练方法可分为全批训练[69]、[70]和小批训练[13]、[71]–[74]，这取决于每一round是否涉及整个图形。这里，我们定义了一round完整的批量训练，包括模型计算阶段（包括正向和反向传播）和参数更新阶段。另一方面，一round小批量训练还包含采样阶段，该阶段对后续模型计算所需的小工作量进行采样，因此位于其他两个阶段之前。因此，被定义为整个数据传递的epoch相当于一轮完整的批量训练，而小批量训练中的epoch通常包含多round。下面将介绍这两种方法的详细信息。

1) Full-batch Training: Full-batch training utilizes the whole graph to update model parameters in each round.

1）整批训练：整批训练利用整个图形来更新每一轮的模型参数。

Given a training set Vt ⊂ V, the loss function of full-batch training is

给定一个训练集Vt⊆V，全批训练的损失函数为

where ∇l( ) is a loss function, yi is the known label of vertex vi, and zi is the output of GNN model when inputting feature xi of vi. In each epoch, GNN model needs to aggregate representations of all neighboring vertices for each vertex inVt all at once. As a result, the model parameters are updated only once at each epoch.

其中，∇l( ) 是一个损失函数，yi是顶点vi的已知标号，zi是GNN模型在输入vi的特征xi时的输出。在每个epoch中，GNN模型需要一次聚合Vt中每个顶点的所有相邻顶点的表示。因此，模型参数在每个epoch仅更新一次。

2) Mini-batch Training: Mini-batch training utilizes part of the vertices and edges in the graph to update model parameters in every forward propagation and backward propagation. It aims to reduce the number of vertices involved in the computation of one round to reduce the computing and memory resource requirements.

2）小批量训练：小批量训练利用图形中的部分顶点和边来更新每个正向传播和反向传播中的模型参数。它旨在减少一轮计算中涉及的顶点数量，以减少计算和内存资源需求。

Before each round of training, a mini-batch Vs is sampled from the training dataset Vt. By replacing the full training dataset Vt in equation (3) with the sampled mini-batch Vs, we obtain the loss function of mini-batch training:

在每一轮训练之前，从训练数据集Vt中采样一个小批量Vs。通过将等式（3）中的完整训练数据集Vs替换为采样的小批量Vs，我们获得了小批量训练的损失函数：

It indicates that for mini-batch training, the model parameters are updated multiple times at each epoch, since numerous mini-batches are needed to have an entire pass of the training dataset, resulting in many rounds in an epoch. Stochastic Gradient Descent (SGD) [75], a variant of gradient descent which applies to mini-batch, is used to update the model parameters according to the loss L.

它表明，对于小批量训练，模型参数在每个epoch被多次更新，因为需要许多小批量来完成训练数据集的整个过程，从而导致epoch中的许多轮。随机梯度下降（SGD）[75]，一种适用于小批量的梯度下降变体，用于根据损失L更新模型参数。

Sampling: Mini-batch training requires a sampling phase to generate the mini-batches. The sampling phase first samples a set of vertices, called target vertices, from the training set according to a specific sampling strategy, and then it samples the neighboring vertices of these target vertices to generate a complete mini-batch. The sampling method can be generally categorized into three groups: Node-wise sampling, Layerwise sampling, and Subgraph-based sampling [56], [76].

采样：小批量训练需要一个采样阶段来生成小批量。采样阶段首先根据特定的采样策略从训练集中采样一组顶点（称为目标顶点），然后对这些目标顶点的相邻顶点进行采样，以生成一个完整的小批量。采样方法通常可分为三组：节点采样、分层采样和基于子图的采样[56]，[76]。

小批量训练需要一个采样阶段来生成小批量。

Node-wise sampling [13], [22], [77], [78] is directly applied to the neighbors of a vertex: the algorithm selects a subset of each vertex's neighbors. It is typical to specify a different sampling size for each layer. For example, in GraphSAGE [13], it samples at most 25 neighbors for each vertex in the first layer and at most 10 neighbors in the second layer.

逐节点采样[13]，[22]，[77]，[78]直接应用于顶点的邻居：算法选择每个顶点邻居的子集。通常为每个层指定不同的采样大小。例如，在GraphSAGE[13]中，它为第一层中的每个顶点采样最多25个邻居，在第二层中采样最多10个邻居。

Layer-wise sampling [71], [72], [79] enhances Nodewise sampling. It selects multiple vertices in each layer and then proceeds recursively layer by layer.

逐层采样[71]、[72]、[79]增强了节点采样。它在每个层中选择多个顶点，然后逐层递归进行。

Subgraph-based sampling [73], [80]–[82] first partition the original graph into multiple subgraphs, and then samples the mini-batches from one or a certain number of them.

基于子图的采样[73]，[80]–[82]首先将原始图划分为多个子图，然后从其中的一个或一定数量的子图中对小批量进行采样。

TAXONOMY OF DISTRIBUTED GNN TRAINING

This section introduces the taxonomy of distributed GNN training. As shown in Fig. 5, we firstly categorize it into distributed full-batch training and distributed mini-batch training, according to the training method introduced in Sec. II-C, i.e., whether the whole graph is involved in each round, and show the key differences between the two types. Each of the two types is classified further into two detailed types respectively by analyzing their workflows. This section introduces the firstlevel category, that is, distributed full-batch training and distributed mini-batch training, and makes a comparison between them. The second-level category of the two types will later be introduced in Sec. IV and Sec. V, respectively.

本节介绍分布式GNN训练的分类。如图5所示，我们首先根据第II-C节中介绍的训练方法将其分为分布式全批量训练和分布式小批量训练，即每一轮是否涉及整个图，并显示了两种类型之间的关键差异。通过分析这两种类型的工作流程，将它们进一步分为两种详细类型。本节介绍了第一级类别，即分布式全批量训练和分布式小批量训练，并对它们进行了比较。这两种类型的第二级分类稍后将分别在第四节和第五节中介绍。

A. Distributed Full-batch Training

Distributed full-batch training is the distributed implementation of GNN full-batch training, as illustrated in Fig. 4. Except for graph partition, a major difference is that multiple computing nodes need to synchronize gradients before updating model parameters, so that the models across the computing nodes remain unified. Thus, a round of distributed full-batch training includes two phases: model computation (forward propagation + backward propagation) and gradient synchronization. The model parameter update is included in the gradient synchronization phase.

分布式全批训练是GNN全批训练的分布式实现，如图4所示。除了图分区之外，一个主要的区别是多个计算节点需要在更新模型参数之前同步梯度，以便计算节点之间的模型保持统一。因此，一轮分布式全批训练包括两个阶段：模型计算（正向传播+反向传播）和梯度同步。模型参数更新包括在梯度同步阶段中。

多个计算节点需要在更新模型参数之前同步梯度，以便计算节点之间的模型保持统一。

Since each round involves the entire raw graph data, a considerable amount of computation and a large memory footprint are required in each round [37], [47], [50]. To deal with it, distributed full-batch training mainly adopts the workload partitioning method [44], [45]: split the graph to generate small workloads, and hand them over to different computing nodes.

由于每一轮都涉及整个原始图形数据，因此每一轮[37]，[47]，[50]都需要大量的计算和大量的内存占用。为了解决这一问题，分布式全批训练主要采用工作负载划分方法[44]，[45]：分割图以生成小工作负载，并将其移交给不同的计算节点。

Such a workflow leads to a lot of irregular inter-node communications in each round, mainly for transferring the features of vertices along the graph structure. This is because the graph data is partitioned and consequently stored in a distributed manner, and the irregular connection pattern in a graph, such as the arbitrary number and location of a vertices' neighbors. Therefore, many uncertainties exist in the communication of distributed full-batch training, including the uncertainty of the communication content, target, time, and latency, leading to challenges in the optimization of distributed full-batch training.

这样的工作流在每一轮中都会导致大量不规则的节点间通信，主要是为了沿着图结构传递顶点的特征。这是因为图形数据被分割并因此以分布式方式存储，并且图形中的不规则连接模式，例如顶点的邻居的任意数量和位置。因此，分布式全批训练的通信中存在许多不确定性，包括通信内容、目标、时间和延迟的不确定性，导致分布式全批训练的优化面临挑战。

As shown in Fig. 5, we further classify distributed fullbatch training more specifically into two categories according to whether the workload is preset in the preprocessing phase, namely dispatch-workload-based execution and presetworkload-based execution, as shown in the second column of Table III. Their detailed introduction and analysis are presented in Sec. IV-A and Sec. IV-B.

如图5所示，根据预处理阶段是否预设了工作负载，我们进一步将分布式全批训练更具体地分为两类，即基于调度工作负载的执行和基于预任务负载的执行，如表III第二列所示。第IV-A节和第IV-B节对其进行了详细介绍和分析。

B. Distributed Mini-batch Training

Similar to distributed full-batch training, distributed minibatch training is the distributed implementation of GNN minibatch training as in Fig. 4. It also needs to synchronize gradients prior to model parameter update, so a round of distributed mini-batch training includes three phases: sampling, model computation, and gradient synchronization. The model parameter update is included in the gradient synchronization phase.

与分布式全批量训练类似，分布式小批量训练是GNN小批量训练的分布式实施，如图4所示。它还需要在模型参数更新之前同步梯度，因此一轮分布式小批量训练包括三个阶段：采样、模型计算和梯度同步。模型参数更新包括在梯度同步阶段中。

Distributed mini-batch training parallelizes the training process by processing several mini-batches simultaneously, one for each computing node. The mini-batches can be sampled either by the computing node itself or by other devices, such as another node specifically for sampling. Each computing node performs forward propagation and backward propagation on its own mini-batch. Then, the nodes synchronize and accumulate the gradients, and update the model parameters accordingly. Such a process can be formulated by

分布式小批量训练通过同时处理多个小批量（每个计算节点一个）来并行化训练过程。可以由计算节点本身或其他设备（例如专门用于采样的另一个节点）对小批量进行采样。每个计算节点在其自己的小批量上执行前向传播和后向传播。然后，节点同步并累积梯度，并相应地更新模型参数。该过程可通过以下方式制定：

where Wi is the weight parameters of model in the ith round of computation, ∇gi,j is the gradients generated in the backward propagation of the computing node j in the ith round of computation, and the n is the number of the computing nodes.

其中Wi是第i轮计算中模型的权重参数，∇gi,j是第i次计算中计算节点j反向传播产生的梯度，n是计算节点的数量。

As shown in Fig. 5, we further classify it more specifically into two categories according to whether the sampling and model computation are decoupled, namely individual-samplebased execution and joint-sample-based execution, as shown in the second column of Table III. Their detailed introduction and analysis are presented in Sec. V-A and Sec. V-B.

如图5所示，根据采样和模型计算是否解耦，我们进一步将其更具体地分为两类，即基于单独样本的执行和基于联合样本的执行，如表III第二列所示。第V-A节和第V-B节对其进行了详细介绍和分析。

C. Comparison Between Distributed Full-batch Training and Distributed Mini-batch Training

This subsection compares distributed full-batch training with distributed mini-batch training of GNNs. The major differences are also summarized in Table II.

本小节比较了GNN的分布式全批量训练和分布式小批量训练。表二还总结了主要差异。

The workflow of distributed full-batch training is summarized as collaborative computation with workload partition. Since the computation in each round involves the entire graph, the computing nodes need to cache it locally, leading to a high memory capacity requirement [44]. Also, the communication volume of distributed full-batch training is large [40], [47]. In every round, the Aggregate function needs to collect the features of neighbors for each vertex, causing a large quantity of inter-node communication requests since the graph is partitioned and stored on different nodes. Considering that the communication is based on the irregular graph structure, the communication irregularity of distributed full-batch training is high [46]. Another characteristic of communication is high uncertainty. The time of generating the communication request is indeterminate, since each computing node sends communication requests according to the currently involved vertices in its own computing process. As a result, the main challenges of distributed full batch training are workload imbalance and massive transmissions [45]–[47].

分布式全批量训练的工作流程概括为具有工作负载划分的协同计算。由于每一轮的计算都涉及到整个图形，因此计算节点需要在本地缓存它，从而导致高内存容量需求[44]。此外，分布式全批量训练的通信量很大[40]，[47]。在每一轮中，Aggregate函数都需要收集每个顶点的邻居的特征，这会导致大量的节点间通信请求，因为图被划分并存储在不同的节点上。考虑到通信基于不规则图形结构，分布式全批量训练的通信不规则性很高[46]。通信的另一个特点是高度不确定性。生成通信请求的时间是不确定的，因为每个计算节点根据其自身计算过程中当前涉及的顶点发送通信请求。因此，分布式全批量训练的主要挑战是工作负载不平衡和大量传输[45]–[47]。

分布式全批量训练的主要挑战是工作负载不平衡和大量传输。

In contrast, the workflow of distributed mini-batch training is summarized as independent computation with periodic synchronization. The major transmission content is the minibatches, sent from the sampling node (or component) to the computing node (or component) responsible for the current mini-batch [51], [52], [83]. As a result, these transmissions have low irregularity and low uncertainty, as the direction and content of transmission are deterministic. Since the computation of each round only involves the mini-batches, it triggers less communication volume and requires less memory capacity [51]. However, the extra sampling phase may cause some new challenges. Since the computation of the sampling phase is irregular and requires access to the whole graph for neighbor information of a given vertex, it is likely to encounter the problem of insufficient sampling performance, causing the subsequent computing nodes (or components) to stall due to lack of input, resulting in a performance penalty [56], [83].

相比之下，分布式小批量训练的工作流被概括为具有周期同步的独立计算。主要传输内容是从采样节点（或组件）发送到负责当前小批量的计算节点（或部件）的小批量[51]，[52]，[83]。因此，这些传输具有低不规则性和低不确定性，因为传输的方向和内容是确定性的。由于每一轮的计算只涉及小批量，因此它触发的通信量更少，需要的内存容量也更少[51]。然而，额外的采样阶段可能会带来一些新的挑战。由于采样阶段的计算是不规则的，并且需要访问给定顶点的相邻信息的整个图，因此很可能会遇到采样性能不足的问题，导致后续计算节点（或组件）由于缺少输入而暂停，从而导致性能损失[56]，[83]。

D. Other Information of Taxonomy

Table III provides a summary of the current studies on distributed GNN training using our proposed taxonomy. Except for the aforementioned classifications, we also add some supplemental information in the table to provide a comprehensive review of them.

表III总结了使用我们提出的分类法进行分布式GNN训练的当前研究。除上述分类外，我们还在表中添加了一些补充信息，以对其进行全面审查。

Software frameworks. The software frameworks used by the various studies are shown in the third column of Table III. PyTorch Geometric (PyG) [84] and Deep Graph Library (DGL) are the most popular among them. In addition, there are many newly proposed software frameworks aiming at distributed training of GNNs, and many of them are the optimization version of PyG [84] or DGL [85]. A detailed introduction to the software frameworks of distributed GNN training is presented in Sec. VI.

软件框架。各种研究使用的软件框架如表III第三列所示。PyTorch Geometric（PyG）[84]和Deep Graph Library（DGL）是其中最受欢迎的。此外，有许多新提出的软件框架旨在对GNN进行分布式训练，其中许多是PyG[84]或DGL[85]的优化版本。第六节详细介绍了分布式GNN训练的软件框架。

Hardware platforms. Multi-CPU platform and multi-GPU platform are the most common hardware platforms of distributed GNN training, as shown in the fourth column of Table III. Multi-CPU platform usually refers to a network with multiple servers, which uses CPUs as the only computing component. On the contrary, in multi-GPU platforms, GPUs are responsible for the major computing work, while CPU(s) handle some computationally complex tasks, such as workload partition and sampling. A detailed introduction to the hardware platforms is presented in Sec. VII.

硬件平台。多CPU平台和多GPU平台是分布式GNN训练最常见的硬件平台，如表III第四列所示。多CPU平台通常指具有多个服务器的网络，其使用CPU作为唯一的计算组件。相反，在多GPU平台中，GPU负责主要的计算工作，而CPU处理一些计算复杂的任务，例如工作负载分区和采样。第七节详细介绍了硬件平台。

Year. The contribution of distributed GNN training began to emerge in 2019 and is now showing a rapid growth trend. This is because more attention is paid to it due to the high demand from industry and academia to shorten the training time of GNN model.

年份。分布式GNN培训的贡献在2019年开始显现，目前呈现快速增长趋势。这是因为工业界和学术界对缩短GNN模型训练时间的要求很高，因此越来越受到关注。

Code available. The last column of Table III simply records the open source status of the corresponding study on distributed GNN training for the convenience of readers.

代码可用。表III的最后一列简单地记录了分布式GNN训练的相应研究的开源状态，以方便读者。

IV. DISTRIBUTED FULL-BATCH TRAINING

This section describes GNN distributed full-batch training in detail. Our taxonomy classifies it into two categories according to whether the workload is preset in the preprocessing phase, namely dispatch-workload-based execution and presetworkload-based execution, as shown in Fig. 6 (a).

本节详细描述了GNN分布式全批量训练。我们的分类法根据预处理阶段是否预设了工作负载，将其分为两类，即基于调度工作负载的执行和基于预任务负载的执行，如图6（a）所示。

A. Dispatch-workload-based Execution

The dispatch-workload-based execution of distributed fullbatch training is illustrated in Fig. 6 (b). Its workflow, computational pattern, communication pattern, and optimization techniques are introduced in detail as follows.

图6（b）说明了分布式全批次训练的基于调度工作量的执行。其工作流程、计算模式、通信模式和优化技术详细介绍如下。

1) Workflow: In the dispatch-workload-based execution, a leader and multiple workers are used to perform training. The leader stores the model parameters and the graph data, and is also responsible for scheduling: it splits the computing workloads into chunks, distributes them to the workers, and collects the intermediate results sent from the workers. It also processes these results and advances the computation. Note that, the chunk we use here is as a unit of workload.

1）工作流：在基于调度工作负载的执行中，使用一名leader和多名workers进行训练。领导者存储模型参数和图形数据，还负责调度：它将计算工作负载分割成chunk，分发给工作人员，并收集工作人员发送的中间结果。它还处理这些结果并推进计算。注意，我们这里使用的chunk是工作负载的一个单位。

2) Computational Pattern: The computational patterns of forward propagation and backward propagation are similar in dispatch-workload-based execution: the latter can be seen as the reversed version of the former. As a result, we only introduce forward propagation's computational pattern here for simplicity. The patterns of the two functions in forward propagation (Aggregate and Combine) differ a lot and will be introduced below respectively.

2）计算模式：前向传播和后向传播的计算模式在基于调度工作负载的执行中是相似的：后者可以被视为前者的相反版本。因此，为了简单起见，我们只在这里引入前向传播的计算模式。前向传播中两个函数（Aggregate和Combine）的模式有很大不同，下面将分别介绍。

Aggregate function. The computational pattern of the Aggregate function is dynamic and irregular, making workload partition for this step a challenge. In the Aggregation step, each vertex needs to aggregate the features of its own neighbors. As a result, the computation of the Aggregation step relies heavily on the graph structure, which is irregular or even changeable. Thus, the number and memory location of neighbors vary significantly among vertices, resulting in the dynamic and irregular computational pattern [86], causing the poor workload predictability and aggravating the difficulty of workload partition.

聚合函数。聚合函数的计算模式是动态的和不规则的，这使得这一步骤的工作负载划分成为一个挑战。在聚合步骤中，每个顶点都需要聚合其自身邻居的特征。因此，聚合步骤的计算严重依赖于不规则甚至可变的图形结构。因此，相邻节点的数量和内存位置在顶点之间显著不同，导致动态和不规则的计算模式[86]，导致工作负载可预测性差，并加剧了工作负载划分的难度。

Combine function. The computational pattern of the Combine function is static and regular, thus the workload partition for it is simple. The computation of the Combination step is to perform neural network operations on each vertex. Since the structure of neural networks is regular and these operations share the same weight parameters, the Combination step enjoys a regular computational pattern. Consequently, a simple partitioning method is sufficient to maintain workload balance, so it is relatively not a major consideration in dispatch-workload-based execution of GNN distributed fullbatch training.

组合函数。Combine函数的计算模式是静态的和规则的，因此它的工作负载划分很简单。组合步骤的计算是对每个顶点执行神经网络操作。由于神经网络的结构是规则的，并且这些操作共享相同的权重参数，因此组合步骤具有规则的计算模式。因此，简单的分区方法足以保持工作负载平衡，因此在GNN分布式全批训练的基于调度工作负载的执行中，它相对不是主要考虑因素。

3) Communication Pattern: The majority of communication is the transmission of input data and output results between the leader and the workers [44]. Since the leader is responsible for workload distribution and intermediate result collection, it needs to communicate with multiple workers. Such a communication structure results in a one-to-many communication pattern, resulting in a possible bottleneck in the leader's communication path [44]. However, these communications are relatively regular since the distribution of tasks is controlled by the leader. Therefore, the communication path congestion can be avoided to some extent by optimized scheduling techniques.

3）通信模式：大部分通信是leader和worker之间输入数据和输出结果的传输[44]。由于leader负责工作负载分配和中间结果收集，因此需要与多个worker进行通信。这种通信结构导致一对多的通信模式，导致leader的通信路径可能出现瓶颈[44]。然而，这些通信相对常规，因为任务的分布由leader控制。因此，通过优化调度技术可以在一定程度上避免通信路径拥塞。

4) Optimization Techniques: Next, we introduce the optimization techniques used to partition workload, balance workload, reduce transmission, and exploit parallelism for dispatch-workload-based execution. We classify them into four categories, including 1 Vertex-centric Workload Partition, 2 Balanced Workload Generation, 3 Transmission Planning, and 4 Feature-dimension Workload Partition. A summary for these categories is shown in Table IV. Note that the transmission reduction in this paper refers to a reduction in the amount of transmitted data or in transmission time.

4）优化技术：接下来，我们介绍用于分区工作负载、平衡工作负载、减少传输以及利用并行性进行基于调度工作负载的执行的优化技术。我们将它们分为四类，包括1、以顶点为中心的工作负载分区、2、平衡工作负载生成、3、传输规划和 4、特征维度工作负载分区。这些类别的摘要显示在表 IV 中。请注意，本文中的传输减少是指传输数据量或传输时间的减少。

1 Vertex-centric Workload Partition. Vertex-centric workload partition refers to the technique of generating workload chunks by partitioning the graph or matrix from the perspective of vertices. Specifically, the graph is partitioned into a list of subgraphs according to the source vertex u and the destination vertex v of edge (u, v). Then the subgraphs are taken as the workload chunk, and the leader distributes them to each computing node for computation. This is a very common partitioning method for processing large-scale graphs in the traditional graph analytics [87]–[89].

1、以顶点为中心的工作负载分区。以顶点为中心的工作负载分区是指通过从顶点的角度对图或矩阵进行分区来生成工作负载块的技术。具体来说，根据边 (u, v) 的源顶点 u 和目标顶点 v，将图划分为子图列表。然后将子图作为workload chunk，由leader分发给各个计算节点进行计算。这是传统图分析中处理大规模图的一种非常常见的分区方法[87]-[89]。

Fig. 7 illustrates 2D graph partition, an typical example of graph partition. The vertices are firstly partitioned into Pdisjoint blocks. Then, we tile the edges into P × P chunks according to their source and destination vertices: in the ythchunk of the xth row, the source vertices of the edges all belong to xth source vertex block, and the destination vertices all belong to yth destination vertex block. This partitioning method works well for the Aggregation step, which needs to transfer the information of the source vertex to the destination vertex along the edge.

图 7 说明了 2D 图分区，图分区的典型示例。顶点首先被划分为 P 个不相交的块。然后，我们根据边的源顶点和目标顶点将边平铺成 P×P 个块：在第 x 行的第 y 个块中，边的源顶点都属于第 x 个源顶点块，目标顶点都属于属于第 y 个目标顶点块。这种划分方法适用于需要将源顶点的信息沿边传递到目标顶点的聚合步骤。

NeuGraph [44] adopts the 2D graph partitioning method. It chooses P as the minimum integer satisfying the requirement to fit each chunk in the device memory of GPUs. Roc [45], on the other hand, uses the graph partition strategy proposed in [89], which can also be mentioned as 1D graph partition. Suppose there are now n computing nodes, and the number of vertices in the graph is T . Then select n-1 numbers, ranging from 1 to T , to split the vertices into n parts, that is, nsubgraphs. In this way, each subgraph holds consecutively numbered vertices, which are stored in adjacent locations, and their in-edges, thereby improving the data access efficiency of computing nodes.

NeuGraph [44] 采用2D图形划分方法。它选择P作为满足要求的最小整数将每个块放入 GPU 的设备内存中。另一方面，Roc [45]使用 [89] 中提出的图分区策略，也可以称为1D图分区。假设现在有 n 个计算节点，图中的顶点数为 T 。然后选择n-1个数，从1到T，将顶点拆分为n个部分，即n个子图。这样，每个子图都保存了存储在相邻位置的连续编号的顶点及其入边，从而提高了计算节点的数据访问效率。

2 Balanced Workload Generation. Workload balance is an extremely important optimization goal for dispatchworkload-based execution. Since the workloads are split and distributed to multiple computing nodes, the prerequisite of the continuation of computing is that all the computing nodes have already returned their intermediate results. If the workload is not evenly partitioned, the consequent lag of waiting will stall the training process. Therefore, it is necessary to carefully adjust the workload partition, so as to make the workload of each computing node as balanced as possible.

2、平衡工作负载生成。工作负载平衡是基于调度工作负载的执行的一个极其重要的优化目标。由于工作负载被拆分分布到多个计算节点，继续计算的前提是所有计算节点都已经返回了中间结果。如果工作负载分配不均，则随之而来的等待滞后会使训练过程停滞。因此需要仔细调整工作负载划分，使各计算节点的工作负载尽可能均衡。

In response to that, Roc [45] proposes a linear regression cost model to produce balanced workloads in each round. The cost model is used to predict the computation time of a GNN layer on an arbitrary input, which could be the whole or any subset of an input graph.

作为回应，Roc [45] 提出了一个线性回归成本模型来在每一轮中产生平衡的工作负载。成本模型用于预测 GNN 层在任意输入上的计算时间，该输入可以是输入图的整个或任何子集。

3 Transmission Planning. Planning data transmission is helpful to make full use of reusable data across computing components and nodes, thereby reducing data transmission. In dispatch-workload-based execution, the major transmission overhead is caused by the requirement to transmit the input data and intermediate results. Recent work focuses on harvesting two following optimization opportunities to reduce this overhead.

3 传输规划。规划数据传输有助于充分利用跨计算组件和节点的可重用数据，从而减少数据传输。在基于调度工作负载的执行中，主要的传输开销是由传输输入数据和中间结果的要求引起的。最近的工作重点是收获以下两个优化机会以减少这种开销。

Avoid repeated transmission of overlapped data. The input data required by different computing tasks may overlap. As a result, caching these overlapped parts on computing nodes or components is a reasonable way to reduce transmission. Roc [45] formulates GPU memory management as a cost minimization problem. It uses a recursive dynamic programming algorithm to find the global optimal solution to decide which part of the data should be cached in GPU memory for reuse, according to the input graph, the GNN model, and device information. This minimizes data transmission between the CPU and GPUs.

避免重复传输重叠数据。不同计算任务所需的输入数据可能会重叠。因此，将这些重叠部分缓存在计算节点或组件上是减少传输的合理方法。 Roc [45] 将 GPU 内存管理制定为成本最小化问题。它使用递归动态规划算法找到全局最优解，根据输入图、GNN 模型和设备来决定哪部分数据应该缓存在 GPU内存中以供重用信息。这最大限度地减少了 CPU 和 GPU 之间的数据传输。

Rationally select the source of the transmission. The overhead of data transmission can be reduced by allowing computing components or nodes to send data, so every component or node can rationally select the source of the transmission. As a result, the design can reduce the overhead by decreasing the transmission distance. For instance, NeuGraph [44] employs a chain-based streaming scheduling scheme. The idea is to have one GPU (which already holds the data chunk) forward the data chunk to its neighbor GPU under the same PCIe switch, which can eliminate the bandwidth contention on the upperlevel shared inter-connection link.

合理选择传输源。通过允许计算组件或节点发送数据可以减少数据传输的开销，因此每个组件或节点都可以合理地选择传输源。因此，该设计可以通过减少传输距离来减少开销。例如，NeuGraph [44] 采用基于链的流调度方案。想法是让一个 GPU（已经持有数据块）将数据块转发到同一 PCIe 交换机下的相邻 GPU，这样可以消除上层共享互连链路上的带宽争用。

4 Feature-dimension Workload Partition. Featuredimension partition refers to the finer partition of the workload from the dimension of the vertex feature, to make full use of the parallel computing hardware in the computing node. In traditional graph analytics, the majority of the input is the structure data of graphs, such as edges. However, in GNNs, vertex features make up the majority of the dataset, while graph structure takes up only a tiny fraction of it. This is because the feature of each vertex in traditional graph analytics is usually just a scalar, while it is a long-length vector or even a large tensor in GNNs. For example, the feature dimension of a vertex can be 512, 1024, or even more [44], [83]. It is thus feasible to split the workload more finely in the feature dimension, thereby further improving the degree of computational parallelism.

4、特征维度工作负载分区。特征维度划分是指从顶点特征的维度对工作负载进行更精细的划分，以充分利用计算节点中的并行计算硬件。在传统的图分析中，大部分输入是图的结构数据，例如边。然而，在 GNN 中，顶点特征构成了数据集的大部分，而图结构只占其中的一小部分。这是因为传统图分析中每个顶点的特征通常只是一个标量，而在 GNN 中它是一个长向量甚至是一个大张量。例如，一个顶点的特征维度可以是512、1024，甚至更多 [44]，[83]。因此可以在特征维度上更精细地拆分工作负载，从而进一步提高计算并行度。

To harvest the opportunity, NeuGraph [44] explores featurelevel parallelism by splitting the feature of each vertex evenly into multiple parts, which partitions the workload in a more fine-grained way. It then assigns it to different GPU threads for corresponding computation. This method takes full advantage of the massive parallel computing resources of GPUs.

为了抓住机会，NeuGraph [44] 通过将每个顶点的特征平均分成多个部分来探索特征级并行性，从而以更细粒度的方式划分工作负载。然后将其分配给不同的 GPU 线程进行相应的计算。该方法充分利用了GPU海量的并行计算资源。

B. Preset-workload-based Execution

The preset-workload-based execution of distributed fullbatch training is illustrated in Fig. 6 (c). Its workflow, computational pattern, communication pattern, and optimization techniques are introduced in detail as follows.

分布式全批训练的基于预设工作负载的执行如图6（c）所示。其工作流程、计算模式、通信模式和优化技术详细介绍如下。

1) Workflow: Preset-workload-based execution involves multiple collaborative workers to perform training. The graph is firstly split into subgraphs through the partition operation in the preprocessing phase. Then, each worker holds one subgraph and a replica of the model parameters. During training, each worker is responsible to complete the computing tasks of all the vertices in its subgraph. As a result, a worker needs to query the information from other workers to gather the information of neighboring vertices. However, the Combine function can be performed directly locally, as the model parameters are replicated, but this also means that gradient synchronization is required for each round to ensure the consistency of model parameters across the nodes.

1）工作流程：基于预设工作负载的执行涉及多个协作workers来执行训练。该图首先在预处理阶段通过分区操作将其分成子图。然后，每个worker持有一个子图和模型参数的副本。在训练过程中，每个worker负责完成其子图中所有顶点的计算任务。因此，一个 worker 需要向其他 worker 查询信息以收集相邻顶点的信息。但是Combine函数可以直接在本地进行，因为模型参数是复制的，但这也意味着每一轮都需要进行梯度同步，以保证模型参数跨节点的一致性。

2) Computational Pattern: The computational pattern of preset-workload-based execution differs significantly in different steps. In the Aggregation step, a node needs to query the neighbor information of vertices from other nodes, so only when each computing node cooperates efficiently can the data be supplied to the target computing node in a timely manner. However, in the Combination step, each computing node conducts the operation on the vertices in its own subgraph independently due to the local replica of model parameters. As a result, the Aggregation step is more prone to inefficiencies such as computational stagnation, and it is the key optimization point for preset-workload-based execution.

2）计算模式：基于预设工作负载的执行的计算模式在不同步骤中显着不同。在Aggregation步骤中，一个节点需要从其他节点查询顶点的邻居信息，因此只有每个计算节点高效协作，才能及时将数据提供给目标计算节点。但是，在Combination步骤中，由于模型参数的本地副本，每个计算节点都独立地对自己子图中的顶点进行操作。因此，聚合步骤更容易出现计算停滞等低效问题，是基于预设工作负载执行的关键优化点。

In addition, the preset workload has the benefit that the whole graph can be loaded at a time, as it is partitioned into small subgraphs which fit in the memory of computing nodes in the preprocessing phase. This means that it is more scalable than dispatch-workload-based execution when either adding more computing resources or increasing the size of the dataset. This avoids frequent accesses to graph data from lowspeed storage such as hard disks, thereby ensuring the timely provision of data for high-speed computing.

此外，预设工作负载的好处是可以一次加载整个图，因为它在预处理阶段被划分为适合计算节点内存的小子图。这意味着在添加更多计算资源或增加数据集大小时，它比基于调度工作负载的执行更具可扩展性。这避免了从硬盘等低速存储中频繁访问图数据，从而保证及时为高速计算提供数据。

3) Communication Pattern: Communication happens mostly in the Aggregation step and gradient synchronization. Due to the distributed storage of graph data, it encounters a lot of irregular transmissions during the Aggregation step when collecting the features of its neighboring vertices that are stored in other nodes due to graph partition [46]. Since the features of vertices are vectors or even tensors, the amount of data transmitted in the Aggregation step is large. Also, the communication is irregular due to the irregularity graph structure [86], [88], [90], which brings difficulty in optimizing connection. In contrast, the communication overhead of gradient synchronization is minuscule due to the small size of model parameters and the regular communication pattern. As a result, the communication between nodes in the Aggregation step is the main concern of preset-workload-based execution in distributed full-batch training.

3）通信模式：通信主要发生在聚合步骤和梯度同步中。由于图数据的分布式存储，由于图分区[46]，在收集存储在其他节点中的相邻顶点的特征时，在聚合步骤中会遇到大量不规则传输。由于顶点的特征是向量甚至是张量，因此Aggregation步骤中传输的数据量很大。此外，由于不规则的图结构[86]、[88]、[90]，通信是不规则的，这给优化连接带来了困难。相比之下，梯度同步的通信开销由于模型参数的小尺寸和规则的通信模式而微乎其微。因此，聚合步骤中节点之间的通信是分布式全批训练中基于预设工作负载的执行的主要关注点。

4) Optimization Techniques: Here we introduce the optimization techniques used to balance workload, reduce transmission, and reduce memory pressure for preset-workloadbased execution in detail. We classify them into four categories: 1 Graph Pre-partition, 2 Transmission Optimization,3 Delayed Aggregation, and 4 Activation Rematerialization. A summary for these categories is shown in Table V.

4）优化技术：这里我们详细介绍用于平衡工作负载、减少传输和减少基于预设工作负载的执行的内存压力的优化技术。我们将它们分为四类：1 图预分区、2 传输优化、3 延迟聚合和 4 激活再实现。这些类别的摘要显示在表 V.

1 Graph Pre-partition. Graph pre-partition refers to partitioning the whole graph into several subgraphs according to the number of computing nodes, mainly to balance workload and reduce transmission [46]. This operation is conducted in the preprocessing phase. The two key principles in designing the partitioning algorithm are listed as follows.

1、图形预分区。图预划分是指根据计算节点的数量将整个图分区，主要是为了平衡工作负载和减少传输[46]。该操作在预处理阶段进行。设计分区算法的两个关键原则如下。

First, in order to pursue workload balance, the subgraphs need to be similar in size. In preset-workload-based execution, each worker performs the computation of vertices within its own subgraph. Therefore, the size of the subgraph determines the workload of the worker. The main reference parameters are the number of vertices and edges in the subgraph. Since preset-workload-based execution has a gradient synchronization barrier, workload balance is very important for computing nodes to complete a round of computation at a similar time. Otherwise, some computing nodes will be idle, causing performance loss.

首先，为了追求工作负载平衡，子图的大小需要相似。在基于预设工作负载的执行中，每个worker在其自己的子图中执行顶点计算。因此，子图的大小决定了worker的工作量。主要参考参数是子图中的顶点数和边数。由于基于预设工作负载的执行具有梯度同步屏障，因此工作负载平衡对于计算节点在相似的时间完成一轮计算非常重要。否则，一些计算节点会空闲，造成性能损失。

Second, minimizing the number of edge-cuts in the graph pre-partition can reduce communication overhead. It is inevitable to cut edges in the graph pre-partition, meaning that the source and destination vertices of an edge might be stored in different computing nodes. When the information of neighboring vertex is required during the computation of the Aggregation step, communication between workers is introduced. Therefore, reducing the number of edges cut can reduce communication overhead in the Aggregation step.

其次，最小化图预分区中的边切割数量可以减少通信开销。在图的预分区中不可避免地要切割边，这意味着边的源顶点和目标顶点可能存储在不同的计算节点中。当在聚合步骤的计算过程中需要相邻顶点的信息时，引入了worker之间的通信。因此，减少边切割的数量可以减少聚合步骤中的通信开销。

DGCL [46] uses METIS library [91] to partition the graph for both the above two targets. Dorylus [49] also uses an edgecut algorithm [92] for workload balance. DistGNN [47] aims at the two targets too. However, it uses a vertex-cut based graph partition technique instead. This means distributing the edges among the partitions. Thus, each edge exists in only one partition, while a vertex can reside in multiple partitions. Any updates to such vertex must be synchronized to its replicas in other partitions. FlexGraph [37] partitions the graph in the manner of edge-cut too. Besides, it learns a cost function to estimate the training cost for the given GNN model. Using the estimated training cost, FlexGraph migrates the workload from overloaded partitions to underloaded ones to pursue workload balance.

DGCL [46] 使用 METIS 库 [91] 为上述两个目标划分图形。 Dorylus [49] 还使用边缘切割算法 [92] 来实现工作负载平衡。 DistGNN [47] 也针对这两个目标。但是，它改用基于顶点切割的图形分区技术。这意味着在分区之间分布边缘。因此，每条边只存在于一个分区中，而一个顶点可以存在于多个分区中。对此类顶点的任何更新都必须同步到其在其他分区中的副本。 FlexGraph [37]也以边切的方式对图进行划分。此外，它还学习了一个成本函数来估计给定 GNN 模型的训练成本。使用估计的训练成本，FlexGraph 将工作负载从过载分区迁移到负载不足的分区以追求工作负载平衡。

2 Transmission Optimization. Transmission optimization refers to adjusting the transmission strategy between computing nodes to reduce communication overhead. Due to the irregular nature of communication in preset-workload-based execution, the demand for transmission optimization is even stronger than above.

2、传输优化。传输优化是指调整计算节点之间的传输策略以减少通信开销。由于基于预设工作量的执行中通信的不规则性，对传输优化的需求比上述更强烈。

DGCL [46] provides a general and efficient communication library for distributed GNN training. It tailors the shortest path spanning tree algorithm to transmission planning, which jointly considers fully utilizing fast links, avoiding contention, and balancing workloads on different links.

DGCL [46] 为分布式 GNN 训练提供了一个通用且高效的通信库。它将最短路径生成树算法裁剪到传输规划中，综合考虑充分利用快速链路，避免竞争，平衡不同链路上的工作负载。

Compared to traditional transmission planning, FlexGraph [37] takes a different approach to take advantage of the aggregation nature of the Aggregate function. FlexGraph [37] partially aggregates the features of neighboring vertices that co-locate at the same partition when possible, aiming to reduce the amount of data transmission and overlap partial aggregations and communications. When each computing node receives a neighbor information request from other nodes, it will first partially aggregate the neighbor information locally, and then send the partial result to the requesting computing node, instead of directly sending the initial neighbor information. The requesting computing node only needs to aggregate the received partial result with its local nodes to continue the computation. As a result, due to the reduction in data transmission and the overlap of computation and communication, the communication overhead is significantly reduced.

与传统的传输规划相比，FlexGraph [37] 采用不同的方法来利用聚合函数的聚合特性。 FlexGraph [37] 在可能的情况下部分聚合了位于同一分区的相邻顶点的特征，旨在减少数据传输量并重叠部分聚合和通信。当每个计算节点收到来自其他节点的邻居信息请求时，它会首先在本地部分聚合邻居信息，然后将部分结果发送给请求计算节点，而不是直接发送初始邻居信息。请求计算节点只需要将接收到的部分结果与其本地节点进行聚合即可继续计算。结果，由于数据传输的减少以及计算和通信的重叠，通信开销显着减少。

3 Delayed Aggregation. By allowing the computing nodes to utilize old transmitted data in the Aggregation step, delayed aggregation can reduce the overhead of data transmission. Normally, there is no scope for intra-epoch overlap due to the dependence between consecutive phases in an epoch. Delayed aggregation solves the problem to overlap the computation and communication, allowing the model to use the previously transmitted data. However, we need to point out that in order to ensure convergence and final accuracy, the delayed aggregation is mainly based on bounded asynchronous training [49].

3、延迟聚合。通过允许计算节点在聚合步骤中使用旧的传输数据，延迟聚合可以减少数据传输的开销。通常，由于一个epoch中连续阶段之间的依赖性，没有epoch内重叠的范围。延迟聚合解决了重叠计算和通信的问题，允许模型使用之前传输的数据。但是，需要指出的是，为了保证收敛性和最终的准确性，延迟聚合主要基于有界异步训练[49]。

Dorylus [49] uses bounded asynchronous training on two synchronization points: the update of weight parameters in the backward propagation, and the neighbor aggregation in the Aggregation step. Bounded asynchronous training of GNNs is based on bounded staleness [93]–[96], an effective technique for mitigating the convergence problem by employing lightweight synchronization. The key policy is to restrict the number of iterations between the fastest worker and the slowest worker to not exceed more than a user-specified staleness threshold s, where s is a natural number. As long as the policy is not violated, there is no waiting time among workers.

Dorylus [49] 在两个同步点上使用有界异步训练：反向传播中权重参数的更新，以及聚合步骤中的邻居聚合。 GNN 的有界异步训练基于有界陈旧性[93]-[96]，这是一种通过采用轻量级同步来缓解收敛问题的有效技术。关键策略是限制最快的worker和最慢的worker之间的迭代次数不超过用户指定的陈旧阈值 s，其中 s 是自然数。只要不违反政策，worker没有等待时间。

DistGNN [47] proposes the delayed remote partial aggregates (DRPA) algorithm to overlap the communication with local computation, which is inter-epoch computationcommunication overlap. In the algorithm, the set of vertices that may be queried by other computing nodes is partitioned into r subsets. For each epoch computation, only the data of one subset is transmitted. The transmitted data is not required to be received at this epoch, but after r epochs. This means that the computing nodes do not use the latest global data of vertices, but locally existing data of them. This algorithm allows communication to overlap with more computational processes, thereby reducing communication overhead.

DistGNN [47] 提出了延迟远程部分聚合（DRPA）算法，将通信与本地计算重叠，即跨epoch计算通信重叠。在该算法中，将可能被其他计算节点查询的顶点集划分为r个子集。对于每个epoch计算，只传输一个子集的数据。传输的数据不需要在这个epoch 接收，而是在r个epoch 之后接收。这意味着计算节点不使用最新的全局顶点数据，而是本地现有的顶点数据。该算法允许通信与更多计算过程重叠，从而减少通信开销。

4 Activation Rematerialization. Activation rematerialization uses data retransmission and recomputation during the computation process to reduce the memory pressure of computing nodes caused by intermediate results or data.

4、激活再生成。激活再生成使用数据重新传输和重新计算过程，减少中间结果或数据对计算节点造成的内存压力。

In preset-workload-based execution, the graph data is stored in each worker in a distributed manner: if there are n workers, then each worker only needs to store 1/n of raw data initially. However, during the computation, each worker needs to store the information received from other workers. In forward propagation, each worker needs to query other workers to obtain the neighbor information of its local vertices, and the information needs to be stored for later use in backward propagation. As a result, the actual data stored by each worker during the computation is much larger than its initial size. In addition, its size is difficult to estimate and may lead to memory overflow problems.

在preset-workload-based execution中，图数据分布式存储在每个worker中：如果有n个worker，那么每个worker最初只需要存储1/n的原始数据。然而，在计算过程中，每个worker都需要存储从其他worker那里收到的信息。在前向传播中，每个worker需要查询其他worker以获得其局部顶点的邻居信息，并且需要存储这些信息以备后向传播使用。结果，每个worker在计算过程中存储的实际数据远大于其初始大小。此外，它的大小很难估计并且可能导致内存溢出问题。

Activation rematerialization, a widely-applied and mature technology in DNN, solves the problem by storing all activations during forward propagation. Activation is the output of each layer of neural networks, which means the representation of each vertex of each layer in GNNs. Its idea is to recompute or load the activations directly from disks during the backward propagation to reduce the pressure of memory [97], [98].

Activation rematerialization 是 DNN 中应用广泛且成熟的技术，它通过在前向传播过程中存储所有激活来解决该问题。激活是神经网络每一层的输出，这意味着GNN 中每一层的每个顶点的表示。它的想法是在反向传播期间直接从磁盘重新计算或加载激活，以减少内存压力[97]、[98]。

SAR [50] draws on the idea of activation rematerialization and proposes sequential aggregation and rematerialization for distributed GNN training. The specific execution flow is as follows. In forward propagation, each computing node only receives activation from one other computing node at a time. After the aggregation operation is completed, the activation is removed immediately. Then the computing node receives the activation from the next computing node and continues the aggregation operation. This makes the activation of each vertex only exist in the computing node where it is located, and there will be no replicas. In backward propagation, the computation is also performed sequentially as above. Each computing node transmits activation sequentially to complete the computation. Through this method, memory will not overflow as long as the memory capacity of the computing node is larger than the size of two subgraphs. This allows SAR to scale to arbitrarily large graphs by simply adding more workers.

SAR [50] 借鉴了激活再生成的思想，并提出了分布式 GNN 训练的顺序聚合和再生成。具体执行流程如下。在前向传播中，每个计算节点一次只接收来自另一个计算节点的激活。聚合操作完成后，立即移除激活。然后计算节点收到来自下一个计算节点的激活并继续聚合操作。这使得每个顶点的激活只存在于它所在的计算节点中，不会有副本。在后向传播中，计算也是按上述顺序进行的。每个计算节点依次传输激活以完成计算。通过这种方式，只要计算节点的内存容量大于两个子图的大小，内存就不会溢出。这允许 SAR 通过简单地添加更多的worker来扩展到任意大的图。

V. DISTRIBUTED MINI-BATCH TRAINING

This section describes GNN distributed mini-batch training in detail. Our taxonomy classifies it into two categories according to whether the sampling and model computation are decoupled, namely individual-sample-based execution and joint-sample-based execution, as illustrated in Fig. 8 (a).

本节详细介绍 GNN 分布式小批量训练。我们的分类法将其分为两类，根据采样和模型计算是否解耦来划分，即基于单独样本的执行和基于联合样本的执行，如图8（a）所示。

A. Individual-sample-based Execution

The individual-sample-based execution of distributed minibatch training is illustrated in Fig. 8 (b). Its workflow, computational pattern, communication pattern, and optimization techniques are introduced in detail as follows.

分布式小批量训练的基于单独样本的执行如图 8（b）所示。其工作流程、计算模式、通信模式和优化技术详细介绍如下。

1) Workflow: The individual-sample-based execution involves multiple samplers and workers so that it decouples sampling from the model computation. The sampler first samples the graph data to generate a mini-batch, and then sends the generated mini-batch to the workers. The worker performs the computation of the mini-batch and conducts gradient synchronization with other workers to update the model parameters. By providing enough computing resources for the samplers to prepare mini-batches for the workers, the computation can be performed without stalls.

1）工作流程：基于单个样本的执行涉及多个采样器和工作器，因此它将采样与模型计算分离。 sampler首先对图数据进行采样生成一个mini-batch，然后将生成的mini-batch发送给workers。 worker 执行 mini-batch的计算并与其他 worker 进行梯度同步以更新模型参数。通过为采样器提供足够的计算资源来为worker准备小批量，可以在没有停顿的情况下执行计算。

A more detailed description of the workflow follows. First, the samplers generate mini-batches by querying the graph structure. Since each worker requires one mini-batch per round of training, the samplers need to generate enough mini-batches in time. These mini-batches are transferred to the workers for subsequent computations. The workers perform forward propagation and backward propagation on their own received mini-batch and generate gradients. After that, gradient synchronization is conducted between workers to update model parameters.

工作流程的更详细描述如下。首先，采样器通过查询图结构生成小批量。由于每个worker每轮训练需要一个小批量，因此采样器需要及时生成足够的小批量。这些小批量被转移到worker进行后续计算。 workers 对自己接收到的 mini-batch进行正向传播和反向传播，并生成梯度。之后，在worker之间进行梯度同步以更新模型参数。

2) Computational Pattern: The computational pattern in the individual-sample-based execution is dominated by the computation of the sampler and the workers, which are responsible for the sampling phase and other computational phases respectively. The sampling phase relies on the graph structure, which leads to irregularities in the computation. These irregularities are reflected in the uncertainty of the data, including its amount and storage address, making it difficult to estimate the computational efficiency of the samplers, as well as the efficiency of mini-batch generation. After that, the sampler sends the mini-batches to the workers, which perform model computation on the batches. In contrast to sampling, the computational efficiency of workers is easy to estimate, since the amount of data and computation between mini-batches are similar. In addition, since the size of the GNN model is small, the gradient synchronization overhead is also small, which makes it generally not a bottleneck [38], [51]. As a result, the difference in the computational pattern between the samplers and workers makes the generation of mini-batch and the balance of consumption the focus of attention. The key optimization point is to accelerate the sampling process, so that it provides enough mini-batches in time to avoid stalls.

2）计算模式：基于单独样本的执行中的计算模式由采样器和worker的计算主导，它们分别负责采样阶段和其他计算阶段。采样阶段依赖于图形结构，这导致计算中存在不规则性。这些不规则性反映在数据的不确定性上，包括其数量和存储地址，使得难以估计采样器的计算效率以及小批量生成的效率。之后，采样器将小批量发送给worker，worker对这些批量进行模型计算。与抽样相比，worker的计算效率很容易估计，因为小批量之间的数据量和计算量是相似的。此外，由于 GNN模型的尺寸较小，因此梯度同步开销也较小，这使得它通常不会成为瓶颈 [38]、[51]。因此，sampler 和worker 计算模式的差异使得 mini-batch 的生成和消耗的平衡成为关注的焦点。关键优化点是加快采样过程，使其及时提供足够的小批量，避免停顿。

3) Communication Pattern: The main communication of individual-sample-based execution is the mini-batch transmission between samplers and workers, which is characterized by regular but frequent. Since the mini-batch consists of a fixed number of target vertices and their limited neighbors, the amount of data is consistent and small. In addition, its transmission target is determined. This makes the transmission regular. Due to the small amount of data in the mini-batch, the amount of computation is also small and it is easy to estimate the time cost required for computation. Due to the continuous consumption of mini-batches by workers, frequent mini-batch transfers are required for maintaining a timely supply of minibatches.

3) 通信模式: individual-sample-basedexecution的主要通信方式是sampler和worker之间的mini-batch传输，具有规律但频繁的特点。由于mini-batch 由固定数量的目标顶点及其有限的邻居组成，因此数据量是一致的并且很小。此外，它的传输目标是确定的。这使得传输正常。由于mini-batch的数据量小，计算量也小，很容易预估计算所需的时间成本。由于worker不断消耗小批量，需要频繁的小批量转移以保持小批量的及时供应。

4) Optimization Techniques: Here we introduce the optimization techniques used to generate mini-batch, balance workload, reduce transmission, and exploit parallelism for individual-sample-based execution in detail. We classify them into four categories: 1 Parallel Mini-batch Generation, 2 Dynamic Mini-batch Allocation, 3 Mini-batch Transmission Pipelining, and 4 Parallel Aggregation with Edge Partitioning. A summary for these categories is shown in Table VI.

4）优化技术：在这里，我们详细介绍了用于生成小批量、平衡工作负载、减少传输以及利用并行性进行基于单个样本的执行的优化技术。我们将它们分为四类：1 并行小批量生成、2 动态小批量分配、3 小批量传输流水线和 4 带边缘分区的并行聚合。这些类别的摘要显示在表VI 中。

1 Parallel Mini-batch Generation. Parallel mini-batch Generation refers to parallelizing the generation of minibatches to reduce the waiting time of the workers for minibatches. It aims to speed up the sampling to provide the workers with sufficient mini-batches in time. As mentioned before, the computational pattern of the sampling is irregular, which is hard to be efficiently performed by GPUs. Therefore, CPUs are generally used to perform sampling [51], [52]. A widely-used solution is to parallelize the generation of minibatches by using CPUs' multi-thread design.

1 并行小批量生成。 Parallel mini-batchGeneration是指并行化mini-batch的生成，以减少worker等待mini-batch的时间。它旨在加快采样速度，以便及时为worker提供足够的小批量。如前所述，采样的计算模式是不规则的，GPU 很难有效地执行这一点。因此，CPU 通常用于执行采样 [51]、[52]。一个广泛使用的解决方案是通过使用 CPU 的多线程设计来并行生成 mini-batches。

SALIENT [56] parallelizes mini-batch generation by using the multi-thread technology of CPUs. It uses C++ threads to end-to-end implement mini-batch generation rather than Python threads to avoid Python's global interpreter lock, and thus improves the performance of mini-batch generation.

SALIENT [56] 通过使用 CPU 的多线程技术并行化小批量生成。它使用 C++ 线程端到端地实现小批量生成，而不是Python 线程避免了 Python 的全局解释器锁，从而提高了 mini-batch 生成的性能。

AGL [36], on the other hand, generates k-hop neighborhoods of vertices through preprocessing to simplify and speed up the generation of mini-batches. Here, the k-hop neighborhoods are the vertices that can be visited within the maximumk edges by the given vertex. AGL proposes a distributed pipeline to generate k-hop neighborhoods of each vertex based on message passing, and implements it with MapReduce infrastructure [99]. The generated k-hop neighborhoods information is stored in the distributed file system. Since it is only necessary to collect the k-hop neighborhoods information of the target vertices when a mini-batch is required, such a design greatly accelerates the sampling process. In order to limit the size of k-hop neighborhoods, AGL implements a set of sampling strategies to selectively collect neighbors.

另一方面，AGL [36] 通过预处理生成顶点的 k 跳邻域，以简化和加速小批量的生成。这里，k 跳邻域是给定顶点在最大 k 条边内可以访问的顶点。 AGL 提出了一种分布式管道，可以根据消息传递生成每个顶点的 k 跳邻域，并使用 MapReduce 基础设施实现它 [99]。生成的 k-hop 邻域信息存储在分布式文件系统中。由于只需要在需要 mini-batch 时收集目标顶点的 k-hop 邻域信息，这样的设计大大加快了采样过程。为了限制 k-hop 邻域的大小，AGL 实现了一组采样策略来选择性地收集邻域。

2 Dynamic Mini-batch Allocation. In dynamic minibatch allocation, mini-batches are dynamically allocated to the workers to ensure that each computing node has mini-batch supply in time and alleviate workload imbalance. Usually, a well-designed allocator is used to collect the mini-batch generated by the samplers, and then supply the mini-batch in time according to the needs of the workers. In this way, the performance of sampling is determined by the efficiency of all samplers, instead of the slowest sampler. This can effectively avoid the situation that the computation is stalled due to the long sampling time of some slow samplers.

2 动态小批量分配。在动态mini-batch分配中，将mini-batch动态分配给worker，保证每个计算节点及时有mini-batch供应，缓解工作负载不平衡。通常，设计良好的分配器用于收集采样器生成的 mini-batch，然后根据 worker 的需要及时供应 mini-batch。这样，采样的性能由所有采样器的效率决定，而不是由最慢的采样器决定。这样可以有效避免一些慢采样器采样时间过长导致计算停顿的情况。

Meanwhile, dynamic mini-batch allocation helps to alleviate workload imbalance. Traditionally, the mini-batch static allocation is that each worker has one or several corresponding samplers, and its sampler is responsible for sampling and transferring mini-batches to it. Due to the irregular computational pattern of sampling, the computing time of a sampler is difficult to estimate, and it may lead to serious workload imbalance in static allocation, as the workers corresponding to some slow samplers have to wait for their mini-batches. In contrast, dynamic mini-batch allocation uniformly manages the workload and dispatches mini-batches to workers in time, thereby avoiding workload imbalance and achieving better performance.

同时，动态小批量分配有助于缓解工作负载不平衡。传统上，mini-batch静态分配是每个worker有一个或几个对应的采样器，它的采样器负责采样并将mini-batch传输给它。由于采样的不规则计算模式，采样器的计算时间难以估计，并且可能导致静态分配中严重的工作负载不平衡，因为一些慢速采样器对应的worker必须等待他们的小批量.相比之下，动态小批量分配统一管理工作负载并及时将小批量分配给worker，从而避免工作负载不平衡并获得更好的性能。

SALIENT [56] implements dynamic mini-batch allocation by using a lock-free input queue. CPU responsible for sampling and GPU responsible for model computation are not in a static correspondence. The number of each worker is sequentially stored in the queue. The mini-batch generated by each CPU will be assigned a destination worker number according to the number stored in the queue during the generation. After the generation, the mini-batch will be sent to the corresponding worker according to the destination number immediately. The lock-free input queue is a simple design that effectively avoids the problem faced by static allocation.

SALIENT [56] 通过使用无锁输入队列实现动态小批量分配。负责采样的CPU和负责模型计算的GPU不是在静态通信中。每个worker的编号顺序存储在队列中。每个CPU 生成的 mini-batch 都会根据生成时存储在队列中的编号分配一个目标 worker 编号。生成后，立即将mini-batch根据目的号发送给对应的worker。无锁输入队列是一种简单的设计，有效避免了静态分配所面临的问题。

3 Mini-batch Transmission Pipelining. Mini-batch transmission pipelining refers to pipelining the transmission of mini-batches and the model computation to reduce the overhead of data transmission.

3 小批量传输流水线。小批量传输流水线是指对小批量传输和模型计算进行流水线化，以减少数据传输的开销。

Due to the decoupled nature of sampling and model computation in individual-sample-based execution, mini-batches need to be transferred from samplers to workers. The overhead of the transmission is high and even occupies up to 35% of the per-epoch time [56]. Fortunately, the mini-batch transmission and model computation do not require strict sequential execution. Thus, the mini-batch transmission and model computation pipelining can be implemented to reduce the mini-batch transmission overhead. This reduces the occurrence of worker stagnation due to the failure of the mini-batch to be supplied in time.

由于基于单个样本的执行中采样和模型计算的解耦性质，需要将小批量从采样器转移到worker。传输的开销很高，甚至占据每个epoch时间的 35% [56]。幸运的是，小批量传输和模型计算不需要严格的顺序执行。因此，可以实现小批量传输和模型计算流水线，以减少小批量传输开销。这样就减少了因mini-batch供应不及时而导致工人停滞的情况发生。

In SALIENT [56], the samplers are CPUs and the workers are GPUs, so mini-batches need to be transferred from CPUs to GPUs. To pipeline mini-batch transmission and model computation, it uses different GPU threads to deal with each of them respectively. AGL [36] uses the idea of pipelining to reduce the data transmission too.

在 SALIENT [56] 中，采样器是 CPU，工作器是 GPU，因此小批量需要从 CPU 转移到 GPU。为了流水线处理小批量传输和模型计算，它使用不同的 GPU线程分别处理它们中的每一个。 AGL [36] 也使用流水线的思想来减少数据传输。

4 Parallel Aggregation with Edge Partitioning. Parallel aggregation with edge partitioning refers to partitioning the edges in mini-batches according to their destination vertices and assigning each partition to an individual thread to accomplish aggregation in parallel.

4 带边缘分区的并行聚合。带边分区的并行聚合是指根据小批量的目标顶点对边进行分区，并将每个分区分配给单独的线程以并行完成聚合。

For the Aggregation operation of GNNs, the neighboring vertex's feature needs to be aggregated for each vertex. It means the computation is actually determined by the edges. AGL [36] proposes edge partitioning techniques to accomplish parallel aggregation according to this property. In each mini-batch, AGL partitions the edges into several partitions according to their destination vertices ensuring that the edges with the same destination vertex are in the same partition. Then these partitions are handled by multiple threads independently. As there are no data conflicts between these threads, the Aggregation operation can be accomplished effectively in parallel. In addition, the workload balance between these threads is guaranteed. It is because the mini-batch generated by sampling ensures that each vertex has a similar number of neighboring vertex, resulting in a similar number of edges per edge partition.

对于 GNN 的聚合操作，需要为每个顶点聚合相邻顶点的特征。这意味着计算实际上是由边决定的。 AGL [36] 提出了边缘分区技术来根据此属性完成并行聚合。在每个mini-batch 中，AGL 根据其目标顶点将边划分为多个分区，以确保具有相同目标顶点的边位于同一分区中。然后这些分区由多个线程独立处理。由于这些线程之间没有数据冲突，聚合操作可以有效地并行完成。此外，保证了这些线程之间的工作负载平衡。这是因为采样生成的mini-batch 确保每个顶点都有相似数量的相邻顶点，从而导致每个边分区的边数量相似。

B. Joint-sample-based Execution

The joint-sample-based execution of distributed mini-batch training is illustrated in Fig. 8 (c). Its workflow, computational pattern, communication pattern, and optimization techniques are introduced in detail as follows.

分布式小批量训练的基于联合样本的执行如图 8（c）所示。其工作流程、计算模式、通信模式和优化技术详细介绍如下。

1) Workflow: In the joint-sample-based execution, multiple collaborative workers are used to perform training. The graph is split into subgraphs through the partition in the preprocessing. Each worker holds one subgraph and a replica of the model parameters. The workers sample their own subgraphs to generate mini-batches and perform forward propagation as well as backward propagation on the mini-batches to obtain gradients. Then they update the model synchronously through communication with other workers.

1）工作流程：在基于联合样本的执行中，使用多个协作worker来执行训练。该图在预处理过程中通过分区被分割成子图。每个worker持有一个子图和模型参数的副本。worker对自己的子图进行采样以生成小批量，并对小批量进行前向传播和反向传播以获得梯度。然后他们通过与其他worker的交流来同步更新模型。

A more detailed description of the workflow follows. First, in order to make the computation of each worker more independent, a well-design graph partition algorithm is used to split the original graph into multiple subgraphs in the preprocessing. Each subgraph is assigned to a worker, and the worker samples its local subgraph, which means the target vertex chosen for sampling is chosen only from its own subgraph. However, the worker may need to query other workers for the neighborhoods of target vertices in the sampling. Second, the worker performs model computation on the minibatch generated by itself. Through forward and backward propagation, each worker produces gradients and conducts gradient synchronization together to update model parameters. Note that the sampling and model computation is conducted on a single worker. This feature makes the computation of each computing node more independent, which greatly reduces the demand for data transmission.

工作流程的更详细描述如下。首先，为了使每个worker的计算更加独立，在预处理中使用设计良好的图划分算法将原始图划分为多个子图。每个子图都分配给一个worker，worker对其局部子图进行采样，这意味着选择采样的目标顶点仅从其自己的子图中选择。然而，worker可能需要向其他worker查询采样中目标顶点的邻域。其次，worker 对自己生成的 mini-batch 进行模型计算。通过前向和反向传播，每个worker产生梯度，一起进行梯度同步，更新模型参数。请注意，采样和模型计算是在单个 worker 上进行的。这一特点使得每个计算节点的计算更加独立，大大降低了对数据传输的需求。

2) Computational Pattern: In the joint-sample-based execution, the computation of each computing node is relatively independent. The computation of each computing node includes two parts: sampling and model computation. In addition, gradient synchronization between computing nodes is required for updating model parameters. In the sampling phase, each computing node mainly uses its own local data, and there may be a need to access remote data for neighbor information of vertices. The mini-batch generated by sampling is directly consumed by itself for model computation. For such a computing pattern, the main concern is how to make the computation of computing nodes more independent, so as to obtain better performance.

2）计算模式：在基于联合样本的执行中，每个计算节点的计算是相对独立的。每个计算节点的计算包括两部分：采样和模型计算。此外，更新模型参数需要计算节点之间的梯度同步。在采样阶段，每个计算节点主要使用自己的本地数据，可能需要访问远程数据获取顶点的邻居信息。采样产生的mini-batch直接被自身消耗，用于模型计算。对于这样的计算模式，主要关心的是如何让计算节点的计算更加独立，从而获得更好的性能。

3) Communication Pattern: There are three communication requirements in joint-sample-based execution, each with a different communication pattern. The first exists in the sampling phase, and its communication pattern is irregular. In the joint-sample-based execution, each computing node selects the target vertices of the mini-batch from its own subgraph. Then the computing node needs to sample the neighbors of these target vertices. Since the graph is partitioned, it may be necessary to query other computing nodes to obtain the neighbor information of these target vertices [38]. As the communication is determined by the graph structure and the graph itself has an irregular connection pattern, its communication pattern is irregular.

3）通信模式：基于联合样本的执行有三种通信需求，每一种都有不同的通信模式。第一个存在于采样阶段，其通信模式是不规则的。在基于联合样本的执行中，每个计算节点从自己的子图中选择小批量的目标顶点。然后计算节点需要对这些目标顶点的邻居进行采样。由于图是分区的，可能需要查询其他计算节点以获得这些目标顶点的邻居信息[38]。由于通信是由图结构决定的，而图本身具有不规则的连接模式，因此其通信模式是不规则的。

The second communication requirement is in the gradient synchronization phase. After sampling and model computation, the computing nodes need to synchronize to update the model parameters. Unlike DNNs, the size of model parameters of GNNs is much smaller, since GNNs have few layers and share weights across all vertices. As a result, it is expected that the gradient synchronization phase occupies a small part of the execution time [38], [51]. However, synchronization with the nature of a fence requires all computing nodes to complete the computation before it can start, so it is highly sensitive to workload imbalance. The low performance of a single computing node will cause other computing nodes to wait, which will result in a large time overhead for this part of the communication [83].

第二个通信需求是在梯度同步阶段。在采样和模型计算之后，计算节点需要同步更新模型参数。与 DNN 不同，GNN 的模型参数的大小要小得多，因为GNN 的层数很少并且在所有顶点之间共享权重。结果，预计梯度同步阶段占据执行时间的一小部分[38]，[51]。但同步具有栅栏的性质，需要所有计算节点完成计算才能开始，因此对工作负载不平衡高度敏感。单个计算节点的低性能会导致其他计算节点等待，这将导致这部分通信的时间开销很大[83]。

The third communication requirement only arises when the sampling and model computation of a single worker are performed on different computing components, such as CPUs and GPUs on a single server [51], [52]. Its communication pattern is regular but with high redundancy. The existence of the same vertex in different mini-batches is called inter-minibatch redundancy, which leads to wasted transmission. This redundancy is exacerbated by the fact that each computing node in the joint-sample-based execution mainly samples on its own subgraph: compared to the sampling on the entire graph, the sampling on the same subgraph exhibits a higher probability of sampling the same vertices for different minibatches [51]. As a result, reducing transmission overhead by eliminating this redundancy is an effective optimization point.

第三个通信要求仅在单个worker的采样和模型计算在不同的计算组件（例如单个服务器上的 CPU 和 GPU）上执行时出现 [51]，[52]。它的通信模式是规则的，但具有很高的冗余度。在不同的小批量中存在相同的顶点称为小批量间冗余，这会导致传输浪费。基于联合样本的执行中的每个计算节点主要在其自己的子图上进行采样这一事实加剧了这种冗余：与在整个图上的采样相比，在同一子图上的采样表现出更高的采样相同顶点的概率对于不同的小批量 [51]。因此，通过消除这种冗余来减少传输开销是一个有效的优化点。

4) Optimization Techniques: Here we introduce the optimization techniques used to reduce transmission overhead and waiting overhead for joint-sample-based execution in detail. We categorize them into four categories, including 1 Localityaware Partition, 2 Partition with Overlap, 3 Independent Execution with Refinement, and 4 Frequently-used Data Caching. A summary for these categories is shown in Table VII.

4）优化技术：这里我们详细介绍了用于减少基于联合样本执行的传输开销和等待开销的优化技术。我们将它们分为四类，包括 1Locality-aware Partition，2 Partition withOverlap，3 Independent Execution with Refinement，以及 4 Frequently-used Data Caching。这些类别的摘要显示在表 VII 中。

1 Locality-aware Partition. Locality-aware partition refers to partitioning the graph into subgraphs with good locality, that is, vertices and their neighbors have a high probability to be in the same subgraph, so most of the data required for sampling is local to the computing node. This partition is conducted in the preprocessing phase.

1 位置感知分区。Locality-aware partition是指将图划分为局部性较好的子图，即顶点及其邻居极有可能在同一个子图中，因此采样所需的大部分数据都在计算节点本地。这种划分是在预处理阶段进行的。

Locality-aware partition aims to make the computation of the nodes more independent by reducing the communication between them. According to the workflow of mini-batch training, the model computation is restricted in using the minibatch data and does not involve access to the entire raw graph data. That means the access to the raw graph data is almost completely in the sampling phase. By focusing on the locality of the graph, each vertex and its neighbors are clustered into a subgraph as much as possible. This makes the workers mainly access their own subgraph during the sampling phase, reducing the remote queries of neighboring vertices, and thus improving the independence of each worker's computation.

局部感知分区旨在通过减少节点之间的通信来使节点的计算更加独立。根据小批量训练的工作流程，模型计算仅限于使用小批量数据，不涉及访问整个原始图数据。这意味着对原始图形数据的访问几乎完全处于采样阶段。通过关注图的局部，每个顶点及其邻居都尽可能聚类成一个子图。这使得worker在采样阶段主要访问自己的子图，减少了对相邻顶点的远程查询，从而提高了每个worker计算的独立性。

The graph partition here is very different from the one of distributed full-batch training mentioned earlier in the paper, which is mainly reflected in its purpose. The graph partition in the preset-workload-based execution of distributed fullbatch training focuses more on workload balance due to its significant impact on performance. Because a large amount of information of the subgraph is related to the execution time, it is necessary to take various factors into account to guide the graph partition, such as the numbers of vertices and edges. In contrast, workload balance is not the critical issue for graph partition in distributed mini-batch training. Because the computation amount of the subgraph is mainly determined by the number of target vertices contained in it, it is only necessary to ensure that the number of target vertices in each subgraph tends to be consistent, which is simple but effective. Such characteristics cause the two graph partitions to be quite different, which in turn leads to different techniques used.

这里的图划分与本文前面提到的分布式全批训练有很大的不同，主要体现在它的用途上。基于预设工作负载的分布式全批训练执行中的图形分区由于其对性能的显着影响而更侧重于工作负载平衡。由于子图的大量信息与执行时间有关，因此需要考虑各种因素来指导图的划分，例如顶点和边的数量。相反，工作负载平衡不是分布式小批量训练中图分区的关键问题。由于子图的计算量主要由其包含的目标顶点个数决定，因此只需保证每个子图中目标顶点个数趋于一致即可，简单但有效。这些特性导致两个图分区有很大不同，这反过来又导致使用不同的技术。

To implement locality-aware partition, PaGraph [51] proposes a formula to guide the graph partition. Its partition algorithm scans the whole target vertex set, and iteratively assigns the scanned vertex to one of K partitions according to the score computed from the formula:

为了实现局部感知分区，PaGraph [51] 提出了一个公式来指导图分区。它的分区算法扫描整个目标顶点集，并根据从公式计算的分数迭代地将扫描的顶点分配给 K 个分区之一：

where T Vi denotes the train vertex set already assigned to the i-th partition, IN (Vt) denotes the L-hop (L is the layer number of GNN model) incoming neighbor set of the target vertex vt, and P Vi denotes the total number of vertices in thei-th partition. T Vavg denotes the expected number of target vertices in the final i-th partition, which is typically set as the total number of vertices |T Vi| divided by the partition number K for workload balance. This formula fully considers the distribution of the vertices' neighbors when splitting them into each subgraph, thus making great use of the locality of the graph.

其中 TVi 表示已经分配给第 i 个分区的训练顶点集，IN (V t ) 表示目标顶点 v t 的 L-hop（L 为 GNN模型的层数）传入邻居集，P V i 表示第 i 个分区中的顶点总数。 T V avg 表示最终第 i 个分区中目标顶点的预期数量，通常设置为顶点总数 |T V i |除以工作负载平衡的分区号 K。该公式在拆分每个子图时充分考虑了顶点邻居的分布，从而充分利用了图的局部性。

Other researches mainly use existing graph partition algorithms to harvest the locality of the graph. 2PGraph [52] uses cluster-based graph partition algorithm [100], [101] in its graph partition. DistDGL [38] takes advantage of the multiconstraint mechanism in METIS [100] for graph partition. AliGraph [35] implements four graph partition algorithms in its system and suggests that the four algorithms are suitable for different circumstances. They are METIS [91] for processing sparse graphs, vertex and edge cut partitions [102] for processing dense graphs, 2-D partition [103] for systems with a fixed number of workers, and streaming-style partition strategy [104] for graphs with frequently edge updates.

其他研究主要使用现有的图分区算法来获取图的局部性。 2PGraph[52]在其图分区中使用基于集群的图分区算法[100]、[101]。DistDGL [38] 利用 METIS [100] 中的多约束机制进行图分区。 AliGraph [35] 在其系统中实现了四种图划分算法，并建议这四种算法适用于不同的情况。它们是用于处理稀疏图的 METIS [91]，用于处理密集图的顶点和边缘切割分区 [102]，用于具有固定数量worker的系统的二维分区 [103]，以及对于经常更新边的图的流式分区策略 [104] ] 。

2 Partition with Overlap. Partition with overlap refers to duplicating the neighboring vertices of the target vertex when partitioning the graph, thereby reducing or even completely eliminating the data transmission between computing nodes in the sampling phase.

2 重叠分区。重叠分区是指在对图进行分区时复制目标顶点的相邻顶点，从而减少甚至完全消除采样阶段计算节点之间的数据传输。

For comparison, we first review the general graph partitioning method used in traditional graph analytics, i.e., partition with non-overlap which means no duplicate method is used. An example of partition with non-overlap is demonstrated in Fig. 9 (a). The original graph is split into two subgraphs, denoted as A and B. The vertices held by each subgraph are disjoint. This has the benefit of linearly reducing the requirement of memory capacity when training large graphs. In the case of n workers, each worker only needs to hold1/n of the original graph data. However, partition with nonoverlap causes some vertices' neighbors to be stored in remote workers, resulting in the need for remote queries during the sampling phase. Intensive transmissions would lead to inefficiencies in the parallel execution of distributed training.

为了进行比较，我们首先回顾了传统图分析中使用的一般图分区方法，即非重叠分区，这意味着不使用重复方法。图 9（a）展示了一个非重叠分区的例子。原始图被分成两个子图，表示为 A 和 B。每个子图所拥有的顶点是不相交的。这具有在训练大型图时线性减少内存容量需求的好处。在n个worker的情况下，每个worker只需要持有原始图数据的1/n。但是，非重叠分区会导致某些顶点的邻居存储在远程worker中，从而导致在采样阶段需要进行远程查询。密集传输会导致分布式训练的并行执行效率低下。

To reduce the amount of transmission in the sampling phase, partition with overlap is proposed. An example of partition with overlap is illustrated in Fig. 9 (b), whose duplicate range is 1-hop neighbors. The original graph is split into two subgraphs with duplicate vertices. Vertex 2 and 7 are duplicated as they are the 1-hop neighboring vertex of target vertices in subgraph A. This method takes advantage of the characteristics of GNN mini-batch training. When the minibatch is sampled, only the L-hop neighboring vertices of the target vertices are needed, where L is the number of GNN layers. Therefore, by duplicating the features of its neighboring vertices, the transmission overhead in the sampling phase can be reduced. These duplicate vertices are regarded as mirror vertices, which do not participate in training as target vertices [102]. If the L-hop neighbors are fully duplicated, the transmission requirements during the sampling phase can be completely eliminated. Note that duplicating more vertices causes an increase in the memory capacity requirements of the computing nodes, so there is a trade-off in how many neighboring vertices are duplicated.

为了减少采样阶段的传输量，提出了重叠分区。重叠分区的示例如图 9（b）所示，其重复范围是 1 跳邻居。原始图被分成两个具有重复顶点的子图。顶点 2 和 7 是重复的，因为它们是子图 A 中目标顶点的 1 跳相邻顶点。该方法利用了 GNN mini-batch训练的特点。当对 mini-batch 进行采样时，只需要目标顶点的 L-hop 相邻顶点，其中 L 是 GNN 层数。因此，通过复制其相邻顶点的特征，采样阶段的传输开销可以减少。这些重复的顶点被视为镜像顶点，它们不作为目标顶点参与训练[102]。如果 L-hop 邻居完全复制，则可以完全消除采样阶段的传输要求。请注意，复制更多顶点会导致计算节点的内存容量需求增加，因此需要权衡复制多少相邻顶点。

Both PaGraph [51] and 2PGraph [52] use this method to make the computation of each worker more independent. DGCL [46] defines a replication factor, which is computed from the total number of vertices (including the original and mirror vertices) kept by all workers divided by the number of vertices in the original graph. In its experiments, for the dense Reddit dataset [13] with 16 GPU workers and 2-hop duplicate range, the replication factor reach 15. In contrast, for the sparser Web-Google dataset [105] with the same setting, the replication factor is much smaller which is 2.5. It suggests that sparse graphs are more suitable for the duplicate method as they have less memory capacity overhead.

PaGraph [51]和2PGraph [52]都使用这种方法使每个worker的计算更加独立。 DGCL [46] 定义了一个复制因子，它是根据所有 worker 保存的顶点总数（包括原始顶点和镜像顶点）除以原始图中的顶点数计算得出的。在其实验中，对于具有 16 个 GPU worker和 2 跳重复范围的密集 Reddit 数据集 [13]，复制因子达到 15。相比之下，对于具有相同设置的稀疏 Web-Google 数据集[105]，复制因子小得多，为 2.5。它表明稀疏图更适合复制方法，因为它们的内存容量开销较小。

3 Independent Execution with Refinement. Independent execution with refinement refers to letting each worker compute and update the model parameters independently, and then periodically averages the model parameters using additional refine operation to release the potential of parallel computation while ensuring high accuracy.

3 独立执行与改进。 Independent execution with refinement 是指让每个 worker 独立计算和更新模型参数，然后使用额外的 refine 操作定期对模型参数进行平均，以释放并行计算的潜力，同时确保高精度。

Although the workers of joint-sample-based execution mainly perform computations on their own local subgraph, there are still interactions between them to advance the training process, including the sampling phase and the gradient synchronization phase. The limitations of these two phases cause each worker's execution to be affected by the state of other workers, resulting in idle wait time.

虽然基于联合样本执行的worker主要在自己的局部子图上执行计算，但它们之间仍然存在交互以推进训练过程，包括采样阶段和梯度同步阶段。这两个阶段的局限性导致每个worker的执行都会受到其他worker状态的影响，造成空闲等待时间。

In order to achieve more independent computation of each computing node, LLCG [53] proposes a method of independent execution with refinement. In its implementation, each worker first performs sampling and corresponding model computation independently on their own subgraphs. That is to say, the data of remote workers is not accessed at all, and each worker only updates its own model parameters locally. A parameter server then periodically collects the model parameters from each worker, and performs the average operation of the parameters. Finally, the parameter server refines the average result, that is, it samples the mini-batch on the whole graph and conducts model computation to update the model parameters of every worker. This is due to the fact that each worker lacks the structural information of the whole graph, which may lead to low accuracy. Through this method, the computation of each worker is more independent while the refinement on model parameters ensures the high accuracy of GNN model.

为了实现每个计算节点更独立的计算，LLCG [53]提出了一种独立执行和细化的方法。在其实现中，每个worker首先在自己的子图上独立进行采样和相应的模型计算。也就是说根本不访问远程worker的数据，每个worker只在本地更新自己的模型参数。然后参数服务器定期收集来自每个worker的模型参数，对参数进行平均运算。最后，参数服务器对平均结果进行细化，即在整个图上对mini-batch进行采样，进行模型计算，更新每个worker的模型参数。这是由于每个worker都缺少整张图的结构信息，可能会导致精度不高。通过这种方法，每个worker的计算更加独立，同时对模型参数的细化保证了GNN模型的高精度。

4 Frequently-used Data Caching. Frequently-used data caching refers to caching the information of frequently used vertices from other computing nodes, including neighbors or features, to reduce the overhead of data transmission.

4 常用数据缓存。常用数据缓存是指缓存来自其他计算节点的常用顶点信息，包括邻居或特征，以减少数据传输的开销。

To reduce the transmission overhead of obtaining the neighboring vertices in the sampling phase, AliGraph [35] proposes to cache the outgoing neighbors of important vertices wherever important vertices are located. Important vertices are identified by the following formula.

为了减少在采样阶段获取相邻顶点的传输开销，AliGraph [35] 建议在重要顶点所在的任何地方缓存重要顶点的传出邻居。重要的顶点由以下公式标识。

where Impk(v) denotes the importance score of the vertex v, N ink(v) and N outk(v) represents the number of k-hop incoming and outgoing neighbors of the vertex v, and U k is a user-specified threshold. When the Impk(v) exceeds U k, the vertex v is considered to be an important vertex. By caching the outgoing neighbors of important vertex v, the transmission overhead derived from the queries of other vertices via v can be reduced.

其中Imp k(v)表示顶点v的重要性得分，N in k(v)和Nout k(v)表示顶点v的k跳进出邻居数，U k是用户-指定的阈值。当Imp k (v) 超过U k 时，顶点v 被认为是一个重要的顶点。通过缓存重要顶点 v 的传出邻居，可以减少通过 v 查询其他顶点的传输开销。

Similarly, the frequently-used data caching technique can also reduce the transmission of mini-batches between CPUs and GPUs. It is the third communication requirement of joint-sample-based execution which occurs when CPUs and GPUs are responsible for sampling and model computation respectively. Due to the benefit of mini-batch training, each GPU only needs to load one mini-batch of data per round of computation, and it is possible to load cached data into the free memory during model computation. Thus, by caching node features and combining them into mini-batches on GPUs, the amount of mini-batch transmission can be reduced.

同样，常用的数据缓存技术也可以减少 CPU 和 GPU之间 mini-batch 的传输。这是基于联合样本执行的第三个通信需求，发生在 CPU 和 GPU 分别负责采样和模型计算时。由于 mini-batch 训练的好处，每个 GPU 每轮计算只需要加载一个 mini-batch 数据，并且可以在模型计算期间将缓存数据加载到空闲内存中。因此，通过缓存节点特征并将它们组合成 GPU 上的 mini-batch，可以减少mini-batch 的传输量。

PaGraph [51] uses a very simple caching strategy. It sorts vertices according to their degree from largest to smallest and then caches them in order, where the degree is the number of incoming edges of a vertex. The higher the degree, the higher the cache priority. This is a simple but effective method: when a vertex has a larger degree, this vertex is more likely to be a neighbor of other target vertices. As for the proportion of cached vertices, it is determined by measuring the free memory capacity of GPUs during the pre-run. 2PGraph [52] proposes GNN Layer-aware Caching. It adopts a sampling method that uses a fixed order of target vertices, which allows it to schedule the vertices to train next. By pre-caching all of the target vertices' L-hop neighbors into GPU memory before the computation, the time overhead of data transmission can be almost completely eliminated. However, fixing the order of target vertices may compromise the model accuracy, so it proposes to periodically permute the order to reduce this effect.

PaGraph [51] 使用了一种非常简单的缓存策略。它根据顶点的度数从大到小对顶点进行排序，然后按顺序缓存它们，其中度数是顶点的传入边数。度数越高，缓存优先级越高。这是一种简单但有效的方法：当一个顶点的度数越大时，该顶点就越有可能是其他目标顶点的邻居。至于缓存顶点的比例，它是通过在预运行期间测量GPU的空闲内存容量来确定的。 2PGraph [52]提出了 GNN 层感知缓存。它采用了一种使用固定顺序的目标顶点的采样方法，这使得它可以安排顶点进行下一次训练。通过在计算之前将所有目标顶点的 L-hop 邻居预先缓存到 GPU 内存中，几乎可以完全消除数据传输的时间开销。然而，固定目标顶点的顺序可能会损害模型的准确性，因此建议定期置换顺序以减少这种影响。

VI. SOFTWARE FRAMEWORKS FORDISTRIBUTED GNN TRAINING

In this section, we introduce the software frameworks of distributed GNN training. Table VIII provides a summary of these frameworks and their supported attributes.

在本节中，我们介绍了分布式 GNN 训练的软件框架。表 VIII 总结了这些框架及其支持的属性。

Basic Software Frameworks. PyG [84] and DGL [85] are the two most popular software frameworks in the GNN community. PyG [84] is a geometric deep learning extension library for PyTorch to enable deep learning on irregular structure data such as graphs. It supports both CPU and GPU computing, providing convenience for using GPU to accelerate the computing process. Through the messagepassing application programming interface (API), it is easy to express various GNN models, as neighbor aggregation is a kind of message propagation. DGL [85] is a framework specialized for deep learning models on graphs. It abstracts the computation of GNNs into a few user-configurable message passing primitives, thus helping users express GNNs more conveniently. It achieves good performance by exploring a wide range of parallelization strategies. It also supports both CPU and GPU computing.

基本软件框架。 PyG [84] 和 DGL [85] 是 GNN 社区中最流行的两个软件框架。 PyG [84] 是 PyTorch 的几何深度学习扩展库，可以对图形等不规则结构数据进行深度学习。同时支持CPU和GPU计算，为使用GPU加速计算过程提供便利。通过消息传递应用程序编程接口（API），很容易表达各种 GNN 模型，因为邻居聚合是一种消息传播。 DGL [85] 是专门用于图形深度学习模型的框架。它将 GNN 的计算抽象为几个用户可配置的消息传递原语，从而帮助用户更方便地表达 GNN。它通过探索广泛的并行化策略来实现良好的性能。它还支持 CPU 和GPU 计算。

Due to their open source and easy-to-use nature, more and more software frameworks are building on top of them and propose optimizations for distributed GNN training, such as DistGNN [47], SAR [50], DistDGL [38], PaGraph [51], LLCG [53], DistDGLv2 [54], SALIENT [56], P3 [106], and so on. Next, we introduce the software frameworks dedicated to distributed full-batch training and distributed mini-batch training.

由于其开源和易于使用的特性，越来越多的软件框架构建在它们之上，并提出分布式 GNN 训练的优化，例如 DistGNN [47]、SAR [50]、DistDGL [38]、PaGraph [51]、LLCG [53]、DistDGLv2 [54]、SALIENT[56]、P3 [106] 等。接下来，我们介绍专用于分布式全批量训练和分布式小批量训练的软件框架。专用于分布式全批训练的软件框架。

Software Frameworks Dedicated to Distributed Fullbatch Training. The software frameworks dedicated to distributed full-batch training includes NeuGraph [44], Roc [45], FlexGraph [37], MG-GCN [48], and Dorylus [49].

专用于分布式全批次训练的软件框架包括 NeuGraph [44]、Roc [45]、FlexGraph [37]、MG-GCN [48] 和 Dorylus [49]。

NeuGraph [44] is a distributed GNN training software framework proposed in 2019, using multi-GPU hardware platform. It is categorized as the dispatch-workload-based execution of distributed full-batch training. It proposes SAGANN, an abstract model for the programming of GNN operations which splits each layer of model computation into four stages: Scatter, ApplyEdge, Gather, and ApplyVertex. SAGANN is named after these stages. The Scatter stage means the vertices scatter their features to their output edges, while the Gather stage means the vertices gather the value from their input edges. There are two user-defined functions used in the ApplyEdge stage and ApplyVertex stage, for users to declare neural network computations on edges and vertices respectively. Through the abstraction of SAGA-NN, users can easily express various GNN models and execute them in a parallelized way. NeuGraph optimizes the training process based on this abstract model, using techniques including vertex-centric workload partition, transmission planning, and feature-dimension workload partition.

NeuGraph [44] 是2019 年提出的分布式 GNN 训练软件框架，使用多 GPU硬件平台。它被归类为分布式全批训练的基于调度工作负载的执行。它提出了 SAGA-NN，一种用于 GNN 操作编程的抽象模型，它将每一层模型计算分为四个阶段：Scatter、ApplyEdge、Gather 和 ApplyVertex。SAGA-NN 就是以这些阶段命名的。 Scatter 阶段意味着顶点将它们的特征分散到它们的输出边，而 Gather 阶段意味着顶点从它们的输入边收集值。 ApplyEdge 阶段和ApplyVertex 阶段使用了两个用户定义的函数，供用户分别在边和顶点上声明神经网络计算。通过SAGA-NN的抽象，用户可以方便的表达各种GNN模型，并以并行的方式执行。 NeuGraph 基于这个抽象模型优化训练过程，使用包括以顶点为中心的工作负载分区、传输规划和特征维度工作负载分区的技术。

Roc [45] is a distributed multi-GPU software framework for fast GNN training and inference, proposed in 2020. It is categorized as the dispatch-workload-based execution of distributed full-batch training. Its optimization of distributed training mainly focuses on balanced workload generation and transmission planning. In terms of balanced workload generation, an online linear regression cost model is proposed to achieve efficient graph partition. The cost model is being tuned by collecting runtime data. According to this cost model, the training resources and time required for the subgraph can be estimated to guide the workload partition for workload balance. In terms of transmission planning, a recursive dynamic programming algorithm is introduced to find the global optimal solution to decide which part of the data should be cached in GPU memory for reuse.

Roc [45] 是一种用于快速 GNN训练和推理的分布式多 GPU 软件框架，于 2020 年提出。它被归类为分布式全批次训练的基于调度工作负载的执行。它对分布式训练的优化主要集中在均衡的工作负载生成上和传输规划。在平衡工作负载生成方面，提出了一种在线线性回归成本模型来实现高效的图分区。正在通过收集运行时数据来调整成本模型。根据该成本模型，可以估计子图所需的训练资源和时间，以指导工作负载划分以实现工作负载平衡。在传输规划方面，引入了递归动态规划算法来寻找全局最优解，以决定数据的哪一部分应该缓存在GPU 内存中以供重用。

FlexGraph [37] is a distributed multi-CPU training software framework proposed in 2021. It is categorized as the presetworkload-based execution of distributed full-batch training. In order to express more kinds of GNNs including GNNs for heterogeneous graphs, it proposes a novel programming abstraction, namely NAU. NAU splits the computation of one GNN layer into three stages including NeighborSelection, Aggregation, and Update stages, each with a user-defined function. Based on NAU, FlexGraph optimizes the training process using techniques including graph pre-partition and partial aggregation. In terms of graph pre-partition, it introduces a cost model to estimate the runtime overhead of the workload, so as to guide the workload partition for workload balance. In terms of partial aggregation, FlexGraph partially aggregates the features of vertices collocated at the same partition when possible. In addition, it overlaps partial aggregations and communication to reduce the transmission overhead.

FlexGraph [37] 是 2021 年提出的分布式多 CPU 训练软件框架。它被归类为基于预设工作负载的分布式全批训练执行。为了表达更多种类的 GNN，包括用于异构图的GNN，它提出了一种新颖的编程抽象，即 NAU。 NAU 将一个 GNN 层的计算分为三个阶段，包括NeighborSelection、聚合和更新阶段，每个阶段都有一个用户定义的函数。基于 NAU，FlexGraph 使用图预分区和部分聚合等技术优化训练过程。在图预划分方面，引入成本模型来估算工作负载的运行时开销，从而指导工作负载划分以实现工作负载平衡。在部分聚合方面，FlexGraph 在可能的情况下部分聚合了并置在同一分区的顶点的特征。此外，它重叠部分聚合和通信以减少传输开销。

MG-GCN [48] is a distributed multi-GPU training software framework proposed in 2021. It is categorized as the presetworkload-based execution of distributed full-batch training. It focuses on the efficient parallelization of the sparse matrixmatrix multiplication (SpMM) kernel on multi-GPU hardware platform. It uses matrix partitioning method to distribute raw data to multiple GPUs, and each GPU is responsible for completing the workload of its own local matrix. It involves the efficient reuse of memory buffers to reduce the memory footprint of training GNN models, and overlaps communication and computation to reduce communication overhead. Specifically, the memory buffer in the computing node is used to cache the data reused by the forward propagation and backward propagation, thereby reducing data transmission. As for the communication and computation overlap, it uses two GPU streams for computation and communication, respectively.

MG-GCN [48] 是 2021 年提出的分布式多 GPU 训练软件框架。它被归类为基于预设工作负载的分布式全批训练执行。它侧重于多 GPU 硬件平台上稀疏矩阵-矩阵乘法(SpMM) 内核的高效并行化。它使用矩阵划分的方法将原始数据分发给多个GPU，每个GPU负责完成自己本地矩阵的工作量。它涉及内存缓冲区的有效重用，以减少训练 GNN模型的内存占用，并重叠通信和减少通信开销的计算。具体来说，计算节点中的内存缓冲区用于缓存前向传播和反向传播重用的数据，从而减少数据传输。至于通信和计算重叠，它使用两个 GPU 流分别进行计算和通信。

Dorylus [49] is a distributed multi-CPU training software framework proposed in 2021. It is categorized as the presetworkload-based execution of distributed full-batch training. Its main focus is on how to train GNNs at a low cost, so it adopts serverless computing. Serverless computing refers to "cloud function" threads, such as AWS Lambda and Google Cloud Functions, that can be used massively in parallel at an extremely low price. The hardware platform of Dorylus consists of CPUs and serverless threads. CPUs mainly perform the Aggregation operation, while the serverless threads are used for the Combination operation due to more regular computation and simpler workload partition in the Combination operations. It adopts a fine-grained workload partition to adapt to the situation that the available hardware resources of serverless threads are quite limited. In addition, asynchronous training is used to make full use of computing resources and reduce stagnation.

Dorylus [49] 是 2021 年提出的分布式多 CPU 训练软件框架。它被归类为基于预设工作负载的分布式全批训练执行。它主要关注如何以低成本训练GNN，因此采用了 Serverless 计算。无服务器计算指的是"云函数"线程，例如 AWS Lambda 和 Google CloudFunctions，可以以极低的价格大规模并行使用。Dorylus 的硬件平台由 CPU 和 Serverless 线程组成。CPU主要执行聚合操作，而无服务器线程用于组合操作，因为组合操作中计算更规律，工作负载分区更简单。它采用细粒度的工作负载分区来适应无服务器线程的可用硬件资源非常有限的情况。此外，采用异步训练，充分利用计算资源，减少停滞。

Software Frameworks Dedicated to Distributed Minibatch Training. The software frameworks dedicated to distributed mini-batch training includes AliGraph [35] and AGL [36].

专用于分布式小批量训练的软件框架。专用于分布式小批量训练的软件框架包括 AliGraph [35] 和 AGL [36]。

AliGraph [35] is a distributed multi-CPU training framework proposed in 2019. It is categorized as the joint-samplebased execution of distributed mini-batch training. It supports not only GNNs for homogeneous graphs and static graphs, but also GNNs for heterogeneous graphs and dynamic graphs. In terms of storage, it adopts a graph partitioning method to store graph data in a distributed manner. The structure and features of the subgraph in each computing node are stored separately. In addition, two caches are added for the features of vertices and edges. Furthermore, it proposes a caching strategy to reduce the communication overhead between computing nodes, that is, each computing node caches the outgoing neighbors of frequently-used vertices.

AliGraph [35] 是 2019 年提出的分布式多 CPU 训练框架。它被归类为分布式小批量训练的基于联合样本的执行。它不仅支持用于同构图和静态图的 GNN，还支持用于异构图和动态图的 GNN。在存储方面，它采用图分区的方式，以分布式的方式存储图数据。每个计算节点中的子图的结构和特征是单独存储的。此外，为顶点和边的特征添加了两个缓存。此外，它提出了一种缓存策略用于减少计算节点之间的通信开销，即每个计算节点缓存经常使用的顶点的传出邻居。

AGL [36] is a distributed multi-CPU training software framework proposed in 2020. It is categorized as the individual-sample-based execution of distributed mini-batch training. To speed up the sampling process, it introduces a distributed pipeline to generate k-hop neighborhood in the spirit of message passing, which is implemented with MapReduce infrastructure [99]. In this way, in the sampling phase, mini-batch data can be rapidly generated by collecting the khop neighbors of the target vertices. In the training phase, the computing nodes are partitioned into workers and parameter servers. The workers perform the model computation on the mini-batches, while the parameter servers maintain the current version of the model parameters. It also uses the commonly used optimization techniques for better efficiency, such as the transmission pipeline.

AGL [36] 是 2020 年提出的分布式多 CPU 训练软件框架。它被归类为分布式小批量训练的基于个体样本的执行。为了加快采样过程，它引入了一个分布式管道来本着消息传递的精神生成 k-hop 邻域，这是用 MapReduce 基础设施实现的[99]。这样，在采样阶段，可以通过收集目标顶点的k-hop邻居来快速生成小批量数据。在训练阶段，计算节点被划分为工作节点和参数服务器。工作人员对小批量执行模型计算，而参数服务器维护模型参数的当前版本。它还使用常用的优化技术来提高效率，例如传输管道。

Software Frameworks Dedicated to Both Distributed Full-batch Training and Mini-batch Training. There exists a software framework dedicated to both distributed full-batch training and distributed mini-batch training.

专用于分布式全批量训练和小批量训练的软件框架。存在一个专用于分布式全批量训练和分布式小批量训练的软件框架。

GraphTheta [55] is a distributed multi-CPU training software framework proposed in 2021. It supports three training methods: mini-batch, full-batch, and cluster-batch training. The cluster-batch training methods is proposed by Chiang [80] in 2019. It first partitions a large graph into a set of smaller clusters. Then, it generates a batch of data either based on one cluster or a combination of multiple clusters. Apparently, cluster-batch restricts the neighbors of a target vertex into only one cluster, which is equivalent to conducting a globalized convolution on a cluster of vertices. There is a parameter server in GraphTheta, which is responsible for managing multi-version model parameters. The worker obtains the model parameters from the parameter server and transfers the generated gradient back to the parameter server for the update of model parameters. Multi-version parameter management makes it possible to train GNNs asynchronously as well as synchronously.

GraphTheta [55] 是 2021 年提出的分布式多 CPU 训练软件框架。它支持三种训练方法：mini-batch、full-batch 和 cluster-batch 训练。Chiang [80] 于 2019 年提出了 cluster-batch 训练方法。它首先将一个大图划分为一组较小的簇。然后，它基于一个集群或多个集群的组合生成一批数据。显然，cluster-batch将一个目标顶点的邻居限制在一个簇中，相当于对一个顶点簇进行全局卷积。 GraphTheta 中有一个参数服务器，负责管理多版本模型参数。 worker从参数服务器获取模型参数，并将生成的梯度传回参数服务器，用于模型参数的更新。多版本参数管理使异步和同步训练 GNN 成为可能。

VII. HARDWARE PLATFORMS FORDISTRIBUTED GNN TRAINING

In this section, we introduce the hardware platforms for distributed GNN training. At present, it is mainly categorized into two types: multi-CPU hardware platform and multi-GPU hardware platform. A summary of the hardware platform for distributed GNN training is shown in Table IX. Next, we introduce them separately.

在本节中，我们介绍了用于分布式 GNN 训练的硬件平台。目前主要分为两类：多CPU硬件平台和多GPU硬件平台。用于分布式 GNN 训练的硬件平台总结如表 IX 所示。下面我们分别介绍一下。

A. Multi-CPU Hardware Platform

The structure diagram of multi-CPU hardware platform is illustrated in Fig. 10 (a). It is mainly composed of multiple computing nodes. Each computing node has one or multiple CPUs. CPUs in each computing node communicate with each other through the network. The graph data is generally stored in each computing node in a distributed storage manner. Each computing node performs computations mainly on its own local data. When remote data is required, it queries other computing nodes through the network to obtain the required data.

多CPU硬件平台结构图如图10(a)所示。它主要由多个计算节点组成。每个计算节点都有一个或多个CPU。每个计算节点中的CPU通过网络相互通信。图数据一般以分布式存储的方式存储在各个计算节点中。每个计算节点主要是自己进行计算本地数据。当需要远程数据时，通过网络查询其他计算节点，获取所需数据。

The main advantage of multi-CPU hardware platform is the good scalability due to the large memory [38], [47]. By increasing the number of computing nodes, large graphs can be maintained completely in memory. This makes it possible to avoid data access to high-latency storage components, such as hard disks, thereby ensuring the timely supply of data. This characteristic makes it ideal for preset-workload-based execution of distributed full-batch training [37], [47] and joint-sample-based execution of distributed mini-batch training [35], [38], [55]. The graph is partitioned and distributed to different computing nodes for storage. Each computing node is responsible for the workload of its own subgraph. This results in computing nodes using mostly local data for computation, although additional communication is also involved. By further optimizing the communication overhead, multi-CPU platforms can achieve good scalability.

多 CPU 硬件平台的主要优点是由于内存大而具有良好的可扩展性 [38]，[47]。通过增加计算节点的数量，可以将大图完全保存在内存中。这样可以避免数据访问硬盘等高延迟存储组件，从而保证数据的及时供应。这一特性使其非常适合基于预设工作负载的分布式全批训练[37]、[47]和基于联合样本的分布式小批量训练[35]、[38]、[55]的执行。将图进行分区，分布到不同的计算节点进行存储。每个计算节点负责自己子图的工作量。这导致计算节点主要使用本地数据进行计算，尽管还涉及额外的通信。通过进一步优化通信开销，多CPU平台可以实现良好的可扩展性。

The main disadvantage of multi-CPU hardware platform is the limited computing resource in each computing node. CPU is a latency-oriented architecture with limited computing resources [107]. This makes it less friendly to neural network operations that incur a lot of computation [108], [109]. This problem is even more pronounced when using distributed mini-batch training. In this case, CPUs perform both the sampling and model computation, leading to a shortage of computing resources.

多CPU硬件平台的主要缺点是每个计算节点的计算资源有限。 CPU 是一种面向延迟的架构，计算资源有限 [107]。这使得它对需要大量计算的神经网络操作不太友好 [108]，[109]。当使用分布式 mini-batch 训练时，这个问题更加明显。在这种情况下，CPU同时进行采样和模型计算，导致计算资源短缺。

B. Multi-GPU Hardware Platform

The structure diagram of multi-GPU hardware platform is illustrated in Fig. 10 (b). Graph data is generally stored in a distributed manner on multiple computing nodes. Each computing node consists of CPUs and GPUs. Computing nodes are connected through high-speed network interfaces. For example, GPUs communicate with each other over PCIe under the same computing node and over InfiniBand across different computing nodes. GPUs can also be interconnected through NVLink and NVSwitch for higher-speed inter-GPU communication.

多GPU硬件平台结构图如图10(b)所示。图数据一般以分布式的方式存储在多个计算节点上。每个计算节点由 CPU和 GPU 组成。计算节点通过高速网络接口连接。例如，GPU 在同一计算节点下通过 PCIe 相互通信，在不同计算节点之间通过 InfiniBand 相互通信。 GPU 还可以通过 NVLink 和 NVSwitch 互连，以实现更高速的 GPU间通信。

The main advantage of multi-GPU hardware platform is the rich computing resources. Compared with CPUs, GPUs have more computing resources that can be used for large-scale parallel computing. This is because GPU is a throughput-oriented architecture which focuses on improving the throughput by using massive computing resources [107]–[109]. In view of the development of DNN, both sparse matrix multiplication and dense matrix multiplication in GPU have been fully developed, which facilitates the model computation of GNN. Due to its limited memory capacity, it is not suitable for distributed fullbatch training. Otherwise, it will result in a large amount of data transmission. Currently, distributed mini-batch training is usually performed on multi-GPU hardware platform. The most popular way to perform distributed mini-batch training on multi-GPU hardware platform is that CPUs are responsible for the sampling phase and GPUs are responsible for the model computation. This is because the computational pattern of the sampling phase is irregular and CPUs are more suitable for the irregular computational pattern than GPUs [88], [90].

多GPU硬件平台的主要优势是丰富的计算资源。与CPU 相比，GPU 具有更多的计算资源，可用于大规模并行计算。这是因为 GPU 是一种面向吞吐量的架构，它专注于通过使用海量计算资源来提高吞吐量 [107]-[109]。鉴于DNN的发展，GPU中的稀疏矩阵乘法和稠密矩阵乘法都得到了充分的发展，这为GNN的模型计算提供了便利。由于其内存容量有限，不适合分布式全批训练。否则会导致大量的数据传输。目前，分布式小批量训练通常在多 GPU硬件平台上进行。在多 GPU 硬件平台上进行分布式mini-batch 训练最流行的方式是 CPU 负责采样阶段，GPU 负责模型计算。这是因为计算模式采样阶段是不规则的，CPU 比 GPU更适合不规则的计算模式 [88]、[90] 。

The main disadvantages of multi-GPU hardware platform are resource contention issues and expensive data transmission overhead. Since there are generally multiple GPUs on a computing node, these GPUs share the computing node's resources, including CPU sampling threads and network interface [44], [83]. The competition for these resources makes multiple GPUs hard to obtain sufficient input data in time. This causes the occurrence of computational stagnation, resulting in poor scalability, manifested by a large gap between the actual speedup and the ideal [83]. The expensive data transmission overhead is due to the tight memory resources of GPUs. When using a multi-GPU hardware platform for distributed fullbatch training, it needs to continuously transmit input data and output data [44]. When performing distributed mini-batch training, it is also necessary to continuously transmit the minibatch data sampled by CPUs [51]. This problem is further exacerbated when the graph is large.

多GPU硬件平台的主要缺点是资源争用问题和昂贵的数据传输开销。由于计算节点上通常有多个GPU，这些 GPU 共享计算节点的资源，包括 CPU 采样线程和网络接口 [44]，[83]。这些资源的竞争使得多个GPU 难以及时获得足够的输入数据。这导致计算停滞的发生，导致可扩展性差，表现为实际加速与理想之间的巨大差距[83]。昂贵的数据传输开销是由于 GPU 的内存资源紧张。当使用多 GPU 硬件平台进行分布式全批量训练时，需要不断传输输入数据和输出数据 [44]。在进行分布式 mini-batch 训练时，还需要不断传输 CPU 采样的mini-batch 数据 [51]。当图形很大时，这个问题会进一步加剧。

C. Multi-CPU Hardware Platform V.S. Multi-GPU Hardware Platform

The summary of the comparison between multi-CPU hardware platform and multi-GPU hardware platform is shown in Table IX. Multi-CPU and multi-GPU hardware platforms have their own advantages and are suitable for different scenarios. The different characteristics of the two also lead to different optimization directions. Multi-CPU hardware platform is more suitable for distributed full-batch training. The entire graph can be cached in the memory, which leads to good scalability. Multi-GPU hardware platform is more suitable for distributed mini-batch training given its limited memory resources. However, when the graph size is small and there are high-speed interconnections between GPUs, such as NVLink, distributed full-batch training using multi-GPU hardware platform can achieve good performance [45].

表 IX 显示了多 CPU 硬件平台和多 GPU 硬件平台之间的比较总结。多CPU和多GPU硬件平台各有优势，适用于不同的场景。两者不同的特点也导致了不同的优化方向。多CPU硬件平台更适合分布式full-batch训练。整个图可以缓存在内存中，这带来了良好的可扩展性。由于内存资源有限，多 GPU 硬件平台更适合分布式mini-batch 训练。然而，当图尺寸较小且 GPU 之间存在高速互连（例如 NVLink）时，使用多 GPU 硬件平台的分布式全批次训练可以获得良好的性能 [45]。

VIII. COMPARISON TO DISTRIBUTED DNN TRAINING

In this section, we highlight the characteristics of distributed computing of GNNs by comparing the distributed training of GNNs with that of DNNs. We first introduce the two categories of distributed DNN training: data parallelism and model parallelism. Among them, for GNN distributed fullbatch training, its operation mode is similar to the model parallelism of DNNs. GNN distributed mini-batch training is similar to the data parallelism of DNNs. The two pairs are compared separately below. A summary of the comparison between distributed GNN training and distributed DNN training is shown in Table X. Next, we introduce them separately.

在本节中，我们通过比较GNN 和 DNN 的分布式训练来强调 GNN 分布式计算的特点。我们首先介绍分布式 DNN 训练的两大类：数据并行和模型并行。其中，对于GNN分布式full-batch训练，其运行方式类似于DNNs的模型并行。 GNN 分布式小批量训练类似于 DNN 的数据并行性。下面分别比较这两对。分布式 GNN 训练和分布式 DNN 训练之间的比较总结如表 X所示。接下来，我们分别介绍它们。

A. Brief Introduction to Distributed DNN Training

Distributed DNN training is categorized into model parallelism and data parallelism [110]. In model parallelism, the model is split into several parts and different computing nodes compute each part. AlexNet [111] is the first one to use model parallelism, because its model size could not fit in one GPU at that time. Therefore, the model is split into two parts, which are then computed by two GPUs together.

分布式 DNN 训练分为模型并行性和数据并行性 [110]。在模型并行中，模型被分成几个部分，不同的计算节点计算每个部分。 AlexNet [111] 是第一个使用模型并行性的，因为当时它的模型大小无法容纳在一个 GPU 中。因此，模型被分成两部分，然后由两个 GPU 一起计算。

In data parallelism, each computing node holds a copy of the model parameters, and the input dataset is distributed to each computing node. Each computing node conducts the training process completely using its own local data and uses the generated gradients to update the model parameters with other computing nodes together. Data parallelism is classified into synchronous training and asynchronous training. However, compared with synchronous distributed training, asynchronous distributed training has lower final accuracy, and sometimes non-convergence may occur [110]. In synchronous distributed training, after each computing node completes one round of training on a small piece of data, the system starts to collect gradients, which are used to update the model parameters uniformly. In this way, each round of training in each computing node is carried out with the same model parameters, which is equivalent to the case of one computing node.

在数据并行中，每个计算节点都持有模型参数的副本，输入数据集是分布到每个计算节点。每个计算节点完全使用自己的本地数据进行训练过程，并使用生成的梯度与其他计算节点一起更新模型参数。数据并行分为同步训练和异步训练。但是，与同步分布式训练相比，异步分布式训练的最终精度较低，有时会出现不收敛的情况[110]。在同步分布式训练中，每个计算节点在一小块数据上完成一轮训练后，系统开始收集梯度，用于统一更新模型参数。这样，每个计算节点的每一轮训练都是用相同的模型参数进行的，相当于一个计算节点的情况。

B. Distributed Full-batch Training V.S. DNN Model Parallelism

Similarities. Both of them distribute the workload of model computation to different computing nodes from the perspective of the model. Each computing node needs to transfer the intermediate results of model computation to each other and cooperate to complete a round of training. In DNN model parallelism, DNN model computation is partitioned into different parts and distributed to different computing nodes [111]. Each computing node may be responsible for executing several layers of DNN, or a set of operations within each layer. Similarly, GNN distributed full-batch training also distributes the workload of model computation to different computing nodes from the perspective of the model [40], [44], [45], [48]. Since the operation of GNN model computation is to operate on vertices, including the Aggregation and Combination operations, the workload is generally distributed to different computing nodes by partitioning the graph. Each computing node is responsible for completing the Aggregation and Combination operation of its assigned vertices and communicating with other computing nodes to jointly advance the computation [37], [46], [47].

相似之处。两者都是从模型的角度将模型计算的工作量分配给不同的计算节点。每个计算节点需要将模型计算的中间结果传递给彼此，并协同完成一轮训练。在 DNN 模型并行性中，DNN 模型计算被分成不同的部分并分布到不同的计算节点 [111]。每个计算节点可能负责执行 DNN的几层，或每层内的一组操作。同样，GNN 分布式全批训练也从模型的角度将模型计算的工作负载分配到不同的计算节点 [40]、[44]、[45]、[48]。由于GNN模型计算的操作是对顶点进行操作，包括Aggregation和Combination操作，因此，工作负载通常通过对图进行分区来分配到不同的计算节点。每个计算节点负责完成其分配的顶点的聚合和组合操作，并与其他计算节点通信以共同推进计算[37]、[46]、[47]。

Differences. DNN model parallelism has low data communication uncertainty, while GNN distributed full-batch training exhibits high data communication uncertainty. In DNN model parallelism, most of the transmitted data is intermediate data generated in the computation. Since the transmission of intermediate data directly affects the execution of other computing nodes, there is a strong execution dependence among computing nodes. Due to the high certainty of data transmission in DNN model parallelism, which is reflected in the fixed data volume and transmission target, the computation and data transmission can be overlapped by reasonable planning to reduce the communication overhead. However, in GNN distributed full-batch training, the data communication uncertainty is high. This is because the data transmitted between each computing node is mainly the feature of the neighboring vertex needed in the Aggregation step. Due to the irregular connection pattern of the graph itself and the fluctuation of the computational efficiency of each computing node, there are great uncertainties in data transmission [46]. Therefore, the congestion of the communication network can easily occur, which in turn leads to stagnation of computation [112].

差异。 DNN 模型并行性具有低数据通信不确定性，而 GNN 分布式全批训练表现出高数据通信不确定性。在 DNN 模型并行中，大部分传输的数据是计算中产生的中间数据。由于中间数据的传输直接影响其他计算节点的执行，因此计算节点之间存在很强的执行依赖性。由于DNN模型并行性数据传输的确定性高，体现在固定的数据量和传输目标，通过合理规划可以实现计算和数据传输的重叠，减少通信开销。然而，在 GNN 分布式全批次训练中，数据通信的不确定性很高。这是因为每个计算节点之间传输的数据主要是聚合步骤中需要的相邻顶点的特征。由于图本身的不规则连接模式和每个计算节点计算效率的波动，数据传输存在很大的不确定性[46]。因此，通信网络很容易发生拥塞，进而导致计算停滞[112]。

C. Distributed Mini-batch Training V.S. DNN Data Parallelism

Similarities. Both of them use small pieces of input data per round of training to update the model parameters. The computing nodes use different small pieces of input data for computation, and then jointly update the model parameters. In DNN data parallelism, each computing node holds a portion of the original data, typically stored in its memory. In each round of computation, a small piece of data is taken from its own local data and computed. Then, using the generated gradients, the model parameters are updated asynchronously or synchronously. In GNN distributed mini-batch training, the training of each computing node also uses independent input data, that is, the mini-batch generated by sampling [51], [52].

相似之处。他们都在每轮训练中使用小块输入数据来更新模型参数。计算节点使用不同的小块输入数据计算，然后联合更新模型参数。在 DNN 数据并行中，每个计算节点都持有一部分原始数据，通常存储在其内存中。在每一轮计算中，都会从自己的本地数据中取出一小块数据进行计算。然后，使用生成的梯度，异步或同步更新模型参数。在 GNN 分布式 mini-batch 训练中，每个计算节点的训练也使用独立的输入数据，即采样生成的mini-batch [51]，[52]。

Differences. The input data required for each round of GNN distributed mini-batch training need to be generated by the sampling, while DNN data parallelism is loaded directly from memory. In DNN data parallelism, each computing node performs computations on its own local data [110]. It does not need to communicate with other computing nodes except for the stage of updating model parameters. In GNN distributed mini-batch training, a large number of communication requirements are required in the sampling phase, including querying graph data and transmitting mini-batch data. As graphs generally exhibit an irregular connection pattern, the communication in the sampling phase is highly irregular, resulting in high communication uncertainty and planning complexity. This leads to the fact that distributed GNN training may not be able to provide input data in time due to the inefficiency of the sampling, resulting in computational stagnation. This makes it necessary to optimize the sampling stage to reduce the communication overhead when conducting GNN distributed mini-batch training [51], [52].

差异。每轮GNN分布式mini-batch训练所需的输入数据需要通过采样生成，而DNN数据并行是直接从内存中加载。在 DNN 数据并行中，每个计算节点对自己的本地数据执行计算 [110]。除更新模型参数阶段外，不需要与其他计算节点通信。在GNN 分布式 mini-batch 训练中，采样阶段需要大量的通信需求，包括查询图数据和传输 mini-batch 数据。由于图通常表现出不规则的连接模式，采样阶段的通信高度不规则，导致通信不确定性和规划复杂性高。这导致分布式GNN 训练可能由于采样效率低下而无法及时提供输入数据，从而导致计算停滞。这使得在进行 GNN 分布式小批量训练 [51]、[52] 时有必要优化采样阶段以减少通信开销。

IX. SUMMARY AND DISCUSSION

This section summarizes the aforementioned details of distributed GNN training, and discusses several interesting issues and opportunities in this field.

本节总结了上述分布式 GNN 训练的细节，并讨论了该领域中几个有趣的问题和机会。

The main focus of optimizing distributed full-batch training is to reduce its communication overhead [37], [40], [44]–[47], [49]. In this training method, each computing node inevitably needs to communicate with other nodes for information about the entire graph structure, so that it can finish operations which require graph data that are not stored in the local split, such as aggregating information of neighboring nodes which are distributed in other computing components. As a result, reducing communication overhead, mainly by reasonably splitting the graph and planning the data transmission, can greatly improve computational efficiency.

优化分布式全批次训练的主要重点是减少其通信开销[37]、[40]、[44]-[47]、[49]。在这种训练方法中，每个计算节点不可避免地需要与其他节点通信以获取整个图结构的信息，以便完成需要未存储在本地拆分中的图数据的操作，例如聚合相邻节点的信息分布在其他计算组件中。因此，主要通过合理拆分图和规划数据传输来减少通信开销，可以大大提高计算效率。

By contrast, the main focuses of optimizing distributed mini-batch training are to speed up the sampling phase and to reduce the mini-batch transmission overhead [35], [36], [38], [51], [52], [56]. Currently, it is popular to perform distributed mini-batch training in which CPUs are responsible for sampling and GPUs are responsible for GNN model computation. The sampling capability of CPUs is insufficient, producing a lot of idleness and limiting the utilization of GPUs due to insufficient data supply. By accelerating sampling and optimizing data transmission, the utilization of computing components can be improved, thereby improving efficiency.

相比之下，优化分布式小批量训练的主要重点是加快采样阶段并减少小批量传输开销 [35]、[36]、[38]、[51]、[52]、[ 56]。目前比较流行的是分布式mini-batch训练，CPU负责采样，GPU负责GNN模型计算。 CPU 的采样能力不足，产生大量空闲，并因数据供应不足而限制 GPU 的利用率。通过加速采样和优化数据传输，可以提高计算组件的利用率，从而提高效率。

The two methods show their advantages in different aspects respectively. In current designs of distributed training of GNNs, the adoption of the mini-batch method is gradually increasing due to its advantages of a simpler implementation and less memory pressure. Distributed full-batch training, on the other hand, requires programmers to put more effort into programming and design, so it is more difficult to optimize. However, some research shows that the final precision of distributed full-batch training can be higher [45].

这两种方法分别在不同方面表现出各自的优势。在当前的分布式训练设计中，在 GNN 中，mini-batch 方法的采用由于其实现更简单和内存压力更小的优点而逐渐增加。而分布式的full-batchtraining，需要程序员在编程和设计上投入更多的精力，优化难度更大。然而，一些研究表明，分布式全批训练的最终精度可以更高 [45]。

Next, we discuss several interesting issues and opportunities in this field.

接下来，我们将讨论该领域中几个有趣的问题和机遇。

A. Quantitative Analysis of Performance Bottleneck

As mentioned before, the performance of distributed GNN training is primarily hindered by the workload imbalance across different computing nodes, irregular transmission between computing nodes, etc. Although these performance bottlenecks are well-known in the GNN community, the quantitative analysis of these bottlenecks, which is important to guide the optimizations for distributed GNN training, is still obscure.

如前所述，分布式 GNN 训练的性能主要受到不同计算节点之间的工作负载不平衡、计算节点之间的不规则传输等阻碍。虽然这些性能瓶颈在 GNN 社区中是众所周知的，但量化分析这些瓶颈对于指导分布式 GNN 训练的优化很重要，但仍然不清楚。

Recently, a lot of characterization efforts for GNN training or inference have been conducted on a single computing node or a single GPU [113]–[118]. For example, Yan et al. [113] quantitatively disclose the computational pattern of the inference on a single GPU. Zhang et al. [114] quantitatively characterize the training of a large portion of GNN variants concerning general-purpose and application-specific hardware architectures. However, there are few characterization efforts on distributed GNN training [116].

最近，在单个计算节点或单个 GPU [113]-[118] 上进行了大量 GNN 训练或推理的表征工作。例如，Yan 等人 [113] 定量地公开了单个 GPU 上推理的计算模式。张等 [114] 定量描述了关于通用和特定应用硬件架构的大部分 GNN 变体的训练。然而，分布式 GNN 训练的表征工作很少 [116]。

Diagnosing performance issues without a quantitative understanding of the bottlenecks can cause misdiagnosis: A potentially expensive effort that involves optimizing something that is not the real problem. Although Lin et al. [116] quantitatively characterize the end-to-end execution of distributed GNN training, it is not enough for the long-term development of distributed GNN training. We believe that the quantitative analysis of performance bottleneck will be one of the main directions in the field of distributed GNN training.

在没有对瓶颈进行定量了解的情况下诊断性能问题可能会导致误诊：一项可能代价高昂的工作涉及优化并非真正问题的内容。虽然林等人 [116] 量化表征分布式 GNN 训练的端到端执行，这对于分布式GNN 训练的长期发展是不够的。我们相信性能瓶颈的定量分析将是分布式 GNN 训练领域的主要方向之一。

B. Performance Benchmark

Despite the large quantity of performance benchmarks presented to fairly evaluate and compare the performance of software frameworks and hardware platforms for DNNs, there are few performance benchmarks for distributed GNN training. Most of the recent benchmarks for GNNs mainly focus on evaluating the prediction accuracy of trained GNN models. For example, Open Graph Benchmark (OGB) [39] provides a set of large-scale real-world benchmark datasets to facilitate the graph machine learning research. However, current benchmarks focus mainly on the accuracy, instead of performance. As a result, standardized performance benchmark suites with typical, representative, and widely-acknowledged workloads are needed for the industry and academia to comprehensively compare the performance of distributed GNN training systems.

尽管提出了大量性能基准来公平地评估和比较 DNN 的软件框架和硬件平台的性能，但分布式 GNN 训练的性能基准却很少。大多数最近的 GNN 基准测试主要集中在评估经过训练的 GNN 模型的预测准确性。例如，OpenGraph Benchmark (OGB) [39] 提供了一组大规模的真实世界基准数据集，以促进图机器学习研究。然而，当前的基准主要关注准确性，而不是性能。因此，业界和学术界需要具有典型、代表性和广泛认可的工作负载的标准化性能基准套件，以全面比较分布式 GNN 训练系统的性能。

In the field of DNNs, MLperf [119], an industry standard benchmark suite for machine learning performance, has made a great contribution to the development of software frameworks and systems. Therefore, we believe that performance benchmark suites for GNNs similar to MLperf are vital to drive the rapid development of GNNs.

在 DNN 领域，机器学习性能的行业标准基准套件 MLperf[119] 为软件框架和系统的开发做出了巨大贡献。因此，我们认为类似于 MLperf 的 GNN 性能基准套件对于推动 GNN 的快速发展至关重要。

C. Distributed Training on Extreme-scale Hardware Platform

For distributed GNN training on large-scale graphs or even extreme-scale graphs, the performance can be greatly improved by using the extreme-scale hardware platform. The scale of the graph data has reached a staggering magnitude and is growing at an exaggerated rate. For example, Sogou [120] graph dataset has 12 trillion edges and Kronecker2 [121] graph dataset has 70 trillion edges. The training on graphs in this order of magnitude puts a very high requirement on the computing resources and memory resources of the computing system. Only by using the extreme-scale hardware platform can the training task be completed in a reasonable time. In this respect, the research of graph processing and distributed DNN training on the extreme-scale hardware platform can be used for reference. ShenTu [120] realizes processing graph with trillion edges in seconds by using petascale computing system. It suggests that the extreme-scale hardware platform can be used to accelerate distributed GNN training, which needs to explore and establish more hardware platform support for distributed GNN training.

对于大规模图甚至超大规模图的分布式GNN训练，使用超大规模硬件平台可以大幅提升性能。图数据的规模已经达到了惊人的数量级，并且正在以惊人的速度增长。例如，搜狗 [120] 图数据集有 12 万亿条边，Kronecker2[121] 图数据集有 70 万亿条边。这种数量级的图训练对计算系统的计算资源和内存资源提出了非常高的要求。只有使用超大规模的硬件平台，才能在合理的时间内完成训练任务。在这方面，极端规模硬件平台上的图处理和分布式DNN训练的研究可以借鉴。申图[120]利用千万亿级计算系统，实现了秒级处理具有万亿条边的图。这表明可以利用超大规模硬件平台来加速分布式GNN训练，这需要探索和建立更多支持分布式GNN训练的硬件平台。

D. Domain-specific Distributed Hardware Platform

There are many single-node domain-specific hardware platforms designed for GNNs. However, very few efforts are conducted on distributed hardware platforms. Recently, many single-node domain-specific hardware accelerators designed for GNNs have achieved significant improvement in performance and efficiency compared with the single-GPU platform. For example, HyGCN [86] proposes a hybrid architecture consisting of the Aggregation engine and Combination engine for the acceleration of the Aggregation operation and Combination operation respectively. Compared with NVIDIA V100 GPU, HyGCN achieves an average 6.5× speedup with 10× energy reduction, respectively. However, the single-node platform is not sufficient to handle the rapid development of GNNs since the ever-growing scale of graphs dramatically increases the training time, generating the demand for distributed GNN hardware platform. In response to that, MultiGCN [112] proposes a distributed hardware platform for the acceleration of GNN inference, which consists of multiple computing nodes to perform the inference in parallel. MultiGCN achieves 2.5∼8× speedup over the state-of-the-art multi-GPU software framework. But only a single work is not enough. We have witnessed that TPU-Pod [122], a domain-specific supercomputer for training DNNs, greatly accelerates the deployment of large-scale DNN models. Therefore, we believe that a domainspecific distributed hardware platform for GNNs similar to TPU-Pod is time to present, which is vital to drive the rapid development and deployment of GNNs.

有许多为 GNN 设计的单节点特定领域硬件平台。然而，在分布式硬件平台上进行的努力很少。最近，与单GPU 平台相比，许多为 GNN 设计的单节点特定领域硬件加速器在性能和效率方面取得了显着提高。例如，HyGCN[86] 提出了一种由聚合引擎和组合引擎组成的混合架构，分别用于加速聚合操作和组合操作。与 NVIDIA V100GPU 相比，HyGCN 分别实现了平均 6.5 倍的加速和 10倍的能量降低。然而，单节点平台不足以应对 GNN 的快速发展，因为不断增长的图规模极大地增加了训练时间，从而产生了对分布式 GNN 硬件平台的需求。作为回应，MultiGCN [112] 提出了一个用于加速 GNN 推理的分布式硬件平台，该平台由多个计算节点组成以并行执行推理。 MultiGCN 比最先进的多 GPU 软件框架实现了2.5∼8 倍的加速。但只有一个作品是不够的。我们已经见证了 TPU-Pod [122]，一种用于训练 DNN 的特定领域超级计算机，极大地加速了大规模 DNN 模型的部署。因此，我们认为类似于 TPU-Pod 的 GNN 领域特定分布式硬件平台是时候出现了，这对于推动 GNN 的快速开发和部署至关重要。

E. General Communication Library for Distributed GNN Training

The variety of hardware platforms and irregular communication patterns of distributed GNN training lead to strong demand for a general communication library to ease the programming effort on transmission planning while achieving high efficiency. The communication characteristics of hardware platforms are various due to the differences in the number of computing nodes, network topology, and communication bandwidth as well as communication latency of interconnection interface. In order to better adapt to the various transmission requirements of distributed GNN training on various hardware platforms, it is urgent to develop a general communication library which adapts to different network topologies, as well as various interconnection interfaces of different communication bandwidths and latency, so as to ease the implementation of novel communication patterns while ensuring efficiency.

分布式 GNN 训练的硬件平台多样性和不规则的通信模式导致强烈的去中心化要求，通用通信库在实现高效率的同时减轻传输规划的编程工作。硬件平台的通信特性因计算节点数量、网络拓扑结构、通信带宽以及互连接口通信时延等方面的差异而各不相同。为了更好地适应分布式GNN训练在各种硬件平台上的各种传输需求，迫切需要开发适应不同网络拓扑的通用通信库，以及不同通信带宽和时延的各种互连接口，以便在确保效率的同时简化新通信模式的实施。

X. CONCLUSION

In this paper, we comprehensively review distributed training of GNNs. After investigating recent efforts on distributed GNN training, we classify them in more detail according to their workflow, including the dispatch-workload-based execution as well as preset-workload-based execution of distributed full-batch training, and the individual-sample-based execution as well as joint-sample-based execution of distributed minibatch training. Each taxonomy's workflow, computational pattern, and communication pattern are summarized, and various optimization techniques are further introduced to facilitate the understanding of recent research status, and to enable readers to quickly build up a big picture of distributed training of GNNs. In addition, this paper introduces the training software and hardware platforms, and then contrasts distributed GNN training with distributed DNN training. In the end, we provide several discussions about interesting issues and opportunities of distributed GNN training.

在本文中，我们全面回顾了 GNN 的分布式训练。在调查了最近在分布式 GNN 训练方面的努力之后，我们根据他们的工作流程对它们进行了更详细的分类，包括基于调度工作负载的执行以及分布式全批训练的基于预设工作负载的执行，以及个人-基于样本的执行以及分布式小批量训练的基于联合样本的执行。总结了每个分类法的工作流程、计算模式和通信模式，并进一步介绍了各种优化技术，以促进对最新研究状况的理解，并使读者能够快速构建 GNN 分布式训练的大局。此外，本文介绍了训练软硬件平台，然后对比了分布式GNN训练和分布式DNN训练。最后，我们提供了几个关于分布式 GNN 训练的有趣问题和机会的讨论。

The emergence of distributed GNN training has successfully expanded the usage of GNNs on large-scale graphs, making it a powerful tool when learning from real-world associative relations. We optimistically look forward to new research which can further optimize this powerful tool on large-scale graph data and bring its application into the next level.

分布式 GNN 训练的出现成功地扩展了 GNN 在大规模图上的使用，使其成为从现实世界的关联关系中学习的强大工具。我们乐观地期待新的研究能够进一步优化这一强大的大规模图数据工具，并将其应用推向新的高度。

ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (Grant No. 61732018, 61872335, and 62202451), Austrian-Chinese Cooperative R&D Project (FFG and CAS) (Grant No. 171111KYSB20200002), CAS Project for Young Scientists in Basic Research (Grant No. YSBR029), and CAS Project for Youth Innovation Promotion Association.

这项工作得到了国家自然科学基金（批准号 61732018、61872335 和 62202451）、中奥合作研发项目（FFG 和CAS）（批准号 171111KYSB20200002）、中国科学院基础研究青年科学家项目的支持（批准号 YSBR-029）和CAS 青年创新促进协会项目。