CS224W: Machine Learning with Graphs
Stanford / Winter 2021
17-scalable
-
考虑SGD训练GNN(以节点分类为例)
-
计算损失时,随机选取一些节点进行反向传播
-
这些节点互相之间是孤立的无关联
-
但GNN节点之间是互相有关联的不独立的,这种损失计算方式不能有效地训练GNNs
-
-
考虑Full-batch训练
- GPU Memory不够
GraphSAGE Neighbor Sampling: Scaling up GNNs
GraphSAGE Neighbor Sampling: Scaling up GNNs
-
Stochastic Training of GNNs
-
Key Insight: To compute embedding of a single node, all we need is the K-hop neighborhood
-
Given a set of M different nodes in a mini-batch, we can generate their embeddings using M computational graphs
-
这样仅需加载一些节点计算图内的节点即可,省GPU Memory
-
但需要得到整个K-hop邻域信息并且进行聚合等变换,计算消耗大
-
-
Neighbor Sampling
-
For k = 1 , 2 , . . . , k k=1,2,...,k k=1,2,...,k
-
For each node in K-hop neighborhood
-
(Randomly) sample at most H k H_k Hk neighbors
-
-
-
How to sample the nodes
-
Random Sampling
- fast but many times not optimal (may sample many “unimportant” nodes)
-
Random Walk with Restarts
-
Natural graphs are “scale free”, sampling random neighbors, samples many low degree “leaf” nodes
-
Strategy to sample important nodes
-
Compute Random Walk with Restarts score R i R_i Ri starting at the green node
-
At each level sample H H H neighbors i i i with the highest R i R_i Ri
-
-
This strategy works much better in practice
-
-
Subgraph Sampling
Subgraph Sampling
-
Issues with Neighbor Sampling
-
The size of computational graph becomes exponentially large w.r.t. the #GNN layers
-
Computation is redundant, especially when nodes in a mini-batch share many neighbors
-
-
Key Idea
- We can sample a small subgraph of the large graph and then perform the efficient layer-wise node embeddings update over the subgraph
-
Subgraphs should retain edge connectivity structure of the original graph as much as possible
-
This way, the GNN over the subgraph generates embeddings closer to the GNN over the original graph
Cluster-GCN
Cluster-GCN
-
Key Insight
- Sample a community as a subgraph. Each subgraph retains essential local connectivity pattern of the original graph
-
Steps
-
Pre-processing
Given a large graph, partition it into groups of nodes (i.e., node-induced subgraphs)
-
We can use any scalable community detection methods, e.g., Louvain, METIS
-
Notice: Between-group edges are not included in G 1 , . . . , G C G_1, ..., G_C G1,...,GC
-
-
Mini-batch training
Sample one node group at a time. Apply GNN’s message passing over the induced subgraph
-
-
Issues with Cluster-GCN
-
The induced subgraph removes between-group links
- As a result, messages from other groups will be lost during message passing, which could hurt the GNN’s performance
-
Graph community detection algorithm puts similar nodes together in the same group
- Sampled node group tends to only cover the small-concentrated portion of the entire data
-
Sampled nodes are not diverse enough to be represent the entire graph structure
-
As a result, the gradient averaged over the sampled nodes becomes unreliable
-
Fluctuates a lot from a node group to another
-
In other words, the gradient has high variance
-
-
Advanced Cluster-GCN
Advanced Cluster-GCN
-
Key Idea
-
Aggregate multiple node groups per mini-batch
-
Partition the graph into relatively-small groups of nodes
-
For each mini-batch
-
Sample and aggregate multiple node groups
-
Construct the induced subgraph of the aggregated node group
-
The rest is the same as vanilla Cluster-GCN
-
-
-
Steps
-
Pre-processing
- The partition needs to be small so that even if multiple of them are aggregated, the resulting group would not be too large
-
Mini-batch training
-
randomly sample a set of q q q node groups
-
Aggregate all nodes across the sampled node groups
-
Extract the induced subgraph (多个groups组成一个induce graph从而使多个聚簇有连边,梯度更稳定)
-
-
Simplifying GNNs
Simplifying GNNs
-
Simplify GCN by removing ReLU non-linearity
H ( k + 1 ) = A ~ H ( k ) W k T \boldsymbol{H}^{(k+1)}=\widetilde{\boldsymbol{A}} \boldsymbol{H}^{(k)} \boldsymbol{W}_{\boldsymbol{k}}^{\mathrm{T}} H(k+1)=A H(k)WkT
-
Removing ReLU significantly simplifies GCN!
H ( K ) = A ~ K X W T \boldsymbol{H}^{(K)}=\widetilde{\boldsymbol{A}}^{K} \boldsymbol{X} \boldsymbol{W}^{\mathrm{T}} H(K)=A KXWT
- A ~ K X \widetilde{\boldsymbol{A}}^{K} \boldsymbol{X} A KX does not contain any learnable hence, it can be pre-computed
-
Let X ~ = A ~ K X \widetilde{\boldsymbol{X}}=\widetilde{\boldsymbol{A}}^{K} \boldsymbol{X} X =A KX be pre-computed matrix
-
Simplified GCN’s final embedding is
H ( K ) = X ~ W T \boldsymbol{H}^{(K)}=\widetilde{\boldsymbol{X}} \boldsymbol{W}^{\mathrm{T}} H(K)=X WT
- It’s just a linear transformation of pre-computed matrix
-
Back to the node embedding form
h v ( K ) = W X ~ v h_{v}^{(K)}=\boldsymbol{W} \widetilde{\boldsymbol{X}}_{v} hv(K)=WX v
- Embedding of node v v v only depends on its own (pre-processed) feature
-
-
Potential Issue of Simplified GCN
- Compared to the original GNN models, simplified GCN’s expressive power is limited due to the lack of non-linearity in generating node embeddings
-
Performance of Simplified GCN
- Surprisingly, in semi-supervised node classification benchmark, simplified GCN works comparably to the original GNNs despite being less expressive.
-
When does Simplified GCN Work?
- Many node classification tasks exhibit homophily structure, i.e., nodes connected by edges tend to share the same target labels (由于simplified的预计算过程就是不断聚合节点邻居的信息,所以对于节点分类任务来说表现得还行(因为相邻的节点趋向拥有相同的label,而这种邻居聚合方式就是传播label信息的途径))