Notes: Survey on GNN Acceleration

最新推荐文章于 2024-09-16 16:34:53 发布

Mochayz

最新推荐文章于 2024-09-16 16:34:53 发布

阅读量160

点赞数

文章标签：深度学习

本文链接：https://blog.csdn.net/Mochayz/article/details/127668036

版权

文章目录

Survey on Graph Neural Network Acceleration: An Algorithmic Perspective
Xin Liu, Mingyu Yan, Lei Deng, Guoqi, Xiaochun, Dongrui Fan, Shirui Pan and Yuan Xie

1 Graph-level Optimization

Graph Sampling

Graph sampling methods generally select partial nodes in one node’s neighborhood or one layer in a graph to acquire subgraphs for subsequent training, which makes efficient computation for model training.

Node-wise sampling methods
- Node-wise sampling is a fundamental sampling method that focuses on each node and its neighbors in a training graph
- Existing works
  - GraphSAGE - randomly selects 2-hop neighbors
  - VR-GCN - restricts the number of neighbors
Layer-wise sampling methods
- Layer-wise sampling samples a fixed number of nodes in each layer based on pre-computed probability
- Existing works
  - FastGCN - samples a certain number of nodes per layer and reconstructs connections between two layers
  - AS-GCN - samples nodes according to the parent nodes sampled in the upper layer
Subgraph-based sampling methods
- Subgraph-based sampling generates subgraphs for training in a two-step manner: sampling nodes and constructing connections (edges)
- Moreover, subgraph sampling can be parallelized at the processing unit level
- Existing works
  - GraphSAINT - utilized node sampler, edge sampler, and random walk sampler
Heterogeneous sampling methods
- Heterogeneous sampling is designed to accelerate the training and handle the graph heterogeneity, i.e., imbalanced number and type of neighbors in a node’s neighborhood
- Existing works
  - HetGNN - traverses and collects neighbors, which are then grouped by type and further sampled based on the visit frequency, in a random walk manner
  - HGT - samples different types of nodes orderly to achieve balanced neighboring distribution

Graph Sparsification

Graph sparsification methods typically remove task-irrelevant edges in a graph by designing a specific optimization goal.

Heuristic sparsification
- DropEdge - randomly removes edges in each training epoch
- FastGAT - presents a resistance-based spectral graph sparsification solution to remove useless edges, reducing the number of attention coefficients in a GAT model
Learnable module for sparsification
- NeuralSparse - utilizes a DNN to learn a sparsification strategy based on the feedback of downstream tasks in training, thus casts graph sparsification as an optimization problem
- GAUG: utilizes an edge predictor to acquire probabilities of edges in a graph

Graph Partition

Graph partition has been an NP-hard problem with the goal of reducing the original graph to smaller subgraphs that are constructed by mutually exclusive node groups.

Graph partition for generic GNN training
- DistGNN - applies a vertex-cut graph partition method, which preserves replicas for nodes segmented with their neighbors
- DistDGL - partitions a graph using METIS, and further colocates nodes/edges features with graph partitions
Graph partition for sampling-based GNN training
- BGL - utilizes a graph partition method to minimize cross-partition communication in sampling subgraphs during GNN training, in which multisource BFS and greedy assignment heuristics are fully leveraged
- Cluster-GCN - partitions a graph into multiple clusters using METIS, after which these clusters are randomly sampled to construct subgraphs
- Pagraph - proposes a GNN-aware graph partition method to ensure balanced partitions in workload and avoid cross-partition visits in the sampling process

2 Model-level Optimizations

GNN Simplification

The forward propagation in the $l$ -th layer in GCN training can be formulated as:
$\textbf{H}^l=\sigma(\tilde{\textbf{D}}^{-1/2}\tilde{\textbf{A}}\tilde{\textbf{D}}^{-1/2}\textbf{H}^{l-1}\textbf{W}^{l-1}),$ in which linear aggregation of neighboring information and nonlinear activation are combined to update node representations. However, recent literature has argued that the efficiency of GNNs can be further promoted by simplifying redundant operations.

Generic simplified GNN
- SGC - removes the nonlinear activation, i.e., ReLU, between each layer, with only the final Softmax function preserved
  $Output=\text{Softmax}(\textbf{S}\cdots\textbf{S}\textbf{S}\textbf{X}\textbf{W}^1\textbf{W}^2\cdots\textbf{W}^l)$
  - Text and graph classification tasks
Special simplified GNN with tasks related
- LightGCN and UltraGCN aim to simplify the GNN model for learning embeddings from user-item interaction graphs in a recommender system
- Existing works
  - LightGCN - abandons feature transformation and nonlinear activation, and drops self-loops in $\textbf{A}$ to simplify the GCN model used in CF
  - UltraGCN - proposes to abandon explicit message passing among multiple layers in LightGCN and get approximate ultimate embeddings directly

GNN Compression

Model compression is a technique that compresses a complicated model, such as a DNN, to a lightweight one with typically fewer parameters preserved.

Model quantification
- Binarized DGCNN and BiGCN - similarly introduce binarization strategies, which compresses model parameters and graph features in GNNs to yield significant acceleration in inference, into GNNs
- Degree-quant - proposes a quantization-aware training method for GNNs
Knowledge distillation
- Knowledge distillation (KD) is a technique that extracts knowledge from a teacher model and injects it into a smaller one with similar performance maintained, which at the same time, yields acceleration on model inference
- Existing works
  - Yang et al. (2020) - propose to preserve the local structure of a teacher model via a special-designed module
  - Yang et al. (2021) - presents an effective KD framework where a specially built student model can jointly benefit from a teacher model and prior knowledge
  - TinyGNN - bridges the gap of extracting neighbor information between a teacher mode and a student model via a combined use of a peer aware module and a neighbor distillation strategy