OSDI’21 GNN 相关论文摘记
Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads
Dorylus: 使用分布式 CPU 服务器和无服务器线程进行经济、可扩展和精确的 GNN 训练
要点
- Serverless computing
- computation separation
背景
- While GPUs offer great efficiency for training, they (and their host machines) are expensive to use.
虽然GPU为培训提供了极大的效率,但它们(及其主机)的使用成本很高。 - GPUs have limited memory, hindering scalability.
GPU内存有限,不利于扩展。 - Existing framework remain unable to handle the billion-edge graphs that are commonplace today.
现有框架仍然无法处理今天常见的十亿个边缘图。 - Serverless computing provides large numbers of parallel “cloud function” threads, or Lambdas, at an extremely low price.
无服务器计算以极低的价格提供了大量并行“云函数”线程, 即 lambda.
挑战
- serverless threads were built to execute light asynchronous tasks
无服务器线程是为了执行轻量异步任务 - Limited compute resources. – how to make computation fit into Lambda’s weak compute profile?
- Restricted network resources – how to minimize the negative impact of Lambda’s network latency?
文章贡献
- devise a low-cost training framework for GNNs on billion-edge graphs
设计了一个针对数十亿边图的低成本的GNN训练框架 - Through computation separation, Dorylus makes it possible,for the first time, for Lambdas to provide a scalable, efficient, and low-cost distributed computing scheme for GNN training.
通过计算分离,Dorylus首次使Lambdas能够为GNN训练提供可扩展、高效和低成本的分布式计算方案。
技术方案
- take advantage of serverless computing to increase scalability at a low cost
利用无服务器计算以低成本提高可伸缩性 - divide a training pipeline into a set offine-grained tasks based on the type of data they process(enabled by computation separation).
根据训练管道处理的数据类型,将训练管道划分为一组细粒度任务(通过计算分离实现)。 - employ a novel parallel computation model, referred to as bounded pipeline asynchronous computation(BPAC). BPAC makes full use of pipelining where different fine-grained tasks overlap with each other.
一种新的并行计算模型,称为有界管道异步计算(BPAC)。BPAC充分利用不同细粒度任务相互重叠的管道,以避免训练过程中的通信瓶颈. - use asynchrony in a novel way at two distinct locations where staleness can be tolerated:parameter updates (in the tensor-parallel path) anddata gathering from neighbor vertices(in the graph-parallel path).
以一种新颖的方式在两个可以容忍过时的不同位置使用异步:参数更新(在张量并行路径中)和从相邻顶点收集数据(在图形并行路径中)。
开源链接
https://github.com/uclasystem/dorylus
GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs
GNNAdvisor:一个在 GPU 上实现 GNN 加速的自适应高效运行时系统
要点
- input-level information in guiding system-level optimizations.
- Optimizations tailored to GNN
- enable automatic runtime optimization
背景
- existing solutions fall short in the following three major aspects
现有解决方案在以下三个主要方面存在不足- Failing to leverage GNN input information.
未能利用GNN输入信息。 - Optimizations not tailored to GNN
不适合GNN的优化 - Poor runtime support for input adaptability
对输入适应性的运行时支持较差
- Failing to leverage GNN input information.
挑战
- inter-node workload imbalance andredundant atomic operations.
节点间工作负载不平衡和冗余原子操作
文章贡献
- the first to explore GNN input properties (e.g., GNN model architectures and input graphs), and give an in-depth analysis of their importance in guiding system optimizations for GPU-based GNN computing.
第一个探索GNN输入属性(例如GNN模型架构和输入图),并深入分析它们在指导基于GPU的GNN计算系统优化方面的重要性。 - propose a set of GNN-tailored system optimizations with parameterization, including a novel 2D workload management (§4) and specialized memory customization (§5) on GPUs.
提出一组参数化的GNN定制系统优化,包括一种2D工作负载管理(§4)和GPU上的专用内存定制(§5)。 - Incorporate the analytical modeling and parameter auto-selection (§6) to ease the design space exploration.
结合了分析建模和参数自动选择(§6),以简化设计空间探索。 - Comprehensive experiments demonstrate the strength of GNNAdvisor over state-of-the-art GNN execution frameworks
全面的实验证明了GNNAdvisor优于最先进的GNN执行框架
技术方案
- 2D workload management
- coarse-grained neighbor partitioning
粗粒度邻居分区 - fine-grained dimension partitioning
细粒度维度分区 - warpbased thread alignment
基于warp的线程对齐。
- coarse-grained neighbor partitioning
- GNN-specialized memory optimizations
- community-aware node renumbering
社区感知节点重新编号 - warp-aware memory customization.
扭曲感知内存自定义
- community-aware node renumbering
开源链接
https://github.com/YukeWang96/OSDI21_AE
Marius: Learning Massive Graph Embeddings on a Single Machine
Marius: 在单机上进行大规模图形嵌入学习
要点
- training of graph embeddings
- key to scalable training of graph embeddings is optimized data movement.
图形嵌入可伸缩训练的关键是优化数据移动
背景
- current systems for learning the embeddings of large-scale graphs are bottlenecked by data movement, which results in poor resource utilization and inefficient training.
当前学习大规模图形嵌入的系统受到数据移动的制约,导致资源利用率低下和培训效率低下。 - graph embedding models aim to capture the global structure of a graph and are complementary to graph neural networks (GNNs) .
图嵌入模型旨在捕获图的全局结构,是对图神经网络(GNN)的补充。 - learning a graph embedding model is a resource intensive process
学习图嵌入模型是一个资源密集型的过程- can be compute intensive
- be memory intensive
- requires optimizing over loss functions that consider the edges of the graph as training examples
需要优化过度损失函数,将图的边作为训练实例
- scaling graph embedding training to instances that do not fit in GPU memory introduces costly data movement overheads that can result in poor resource utilization and slow training.
将图形嵌入训练扩展到不适合GPU内存的实例会带来昂贵的数据移动开销,从而导致资源利用率低下和训练速度缓慢。
挑战
- training of large graph embedding model
大型图嵌入模型的训练
文章贡献
- show that existing state-of-the-art graph embedding systems are hindered by IO inefficiencies when moving data from disk and from CPU to GPU
表明现有的最先进的图形嵌入系统在将数据从磁盘和CPU移动到GPU时受到IO效率低下的阻碍 - introduce the Buffer-aware Edge Traversal Algorithm (BETA), an algorithm to generate an IO minimizing data ordering for graph learning
介绍缓冲区感知边缘遍历算法(BETA),该算法用于生成用于图学习的IO最小化数据顺序, - combine the BETA ordering with a partition buffer and async IO via pipelining to introduce the first graph learning system that utilizes the full memory hierarchy (Disk-CPU-GPU).
通过流水线将BETA排序与分区缓冲区和异步IO相结合,引入第一个利用完整内存层次结构(磁盘CPU GPU)的图形学习系统。
技术方案
- proposeda pipelined architecture that leverages partition caching and buffer-aware data orderings to minimize disk access and interleaves data movement with computation to maximize utilization.
提出一种流水线结构, 利用分区缓存和缓冲区感知数据排序来最大限度地减少磁盘访问,并将数据移动与计算交错,以最大限度地提高利用率。 - introduce asynchronous training of nodes withbounded staleness.
引入了节点的异步训练,并具有有限的过时性 - design an in-memorypartition bufferthat can hide and reduce IO from swapping of partitions.
设计了一个内存分区缓冲区,可以隐藏和减少分区交换的IO。 - introduce abuffer-aware ordering that uses knowledge of the buffer size and what resides in it to minimize the number of IOs.
提出了一种支持缓冲区的排序方法,它使用缓冲区大小以及缓冲区中驻留的内容的知识来最小化IOs的数量。
开源链接
https://github.com/marius-team/marius/tree/osdi2021
P3: Distributed Deep Graph Learning at Scale
P3:大规模分布式深度图学习
要点
- scaling GNN model training to large real-world graphs in a distributed setting.
将GNN模型训练扩展到分布环境中的大型真实图中。 - eliminate the overheads with partitioning and reduce network communication
背景
- scalability a fundamental issue in training GNNs
可伸缩性是训练GNN的一个基本问题 - network communication accounts for a major fraction of training time
网络通信占训练时间的一大部分
挑战
- enabling GNN training in adistributed fashion
以分布式方式进行GNN训练- Communication Bottlenecks Due to Data Dependencies
数据依赖性导致的通信瓶颈 - Ineffectiveness of Partitioning
无效的图划分 - GPU Underutilization
GPU未充分利用
- Communication Bottlenecks Due to Data Dependencies
文章贡献
- observe the shortcomings with applying distributed graph processing techniques for scaling GNN model training
观察了应用分布式图处理技术扩展GNN模型训练的缺点 - a new approach of relying only on random hash partitioning of the graph and features independently, thus effectively eliminating the overheads with partitioning.
一种全新的方法,只依赖于图和特性的随机散列独立划分,从而有效地消除了划分的开销。 - propose a novel hybrid parallelism based execution strategy that combines intra-layer model parallelism with data parallelism that significantly reduces network communication and allows many opportunities for pipelining compute and communication.
提出了一种新的混合并行执行策略,该策略将层内模型并行与数据并行相结合,显著减少了网络通信,并为流水线计算和通信提供了许多机会。 - scale to large graphs gracefully and that it achieves significant performance benefits
可以优雅地扩展到大型图形,并获得显著的性能优势
技术方案
- practically eliminates the need for any intelligent partitioning of the graph, and proposes independently partitioning the input graph and features.
消除了对图进行任何智能划分的需要,并提出对输入图和特征进行独立划分。 - completely avoids communicating (typically huge) features over the network by adopting a novel pipelined push-pull execution strategy that combines intra-layer model parallelism and data parallelism and further reduces overheads using a simple caching mechanism
通过采用结合层内模型并行性和数据并行性的新型流水线推送执行策略,完全避免通过网络通信(通常是巨大的)功能,并使用简单的缓存机制进一步减少开销