CS224W: Machine Learning with Graphs - 07 Graph Neural Networks (GNN) 2: Design Space

最新推荐文章于 2024-08-01 20:59:33 发布

xbfu-xjtu

最新推荐文章于 2024-08-01 20:59:33 发布

阅读量342

点赞数

文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/fxb163/article/details/122098711

版权

Design Space

0. A General GNN Framework

1). Message

2). Aggregation

GNN layer = Message + Aggregation
Different instantiation under this perspective (GCN, GraphSAGE, GAT, …)

3). Layer Connectivity

connect GNN layers into a GNN

Stack layers sequentially
Ways of adding skip connections

4). Graph Manipulation

Idea: raw input graph $\neq$ computational graph

Graph feature augmentation
Graph structure manipulation

5). Learning Objective

How do we train a GNN

Supervised/unsupervised objectives
Node/edge/graph level objectives

1. A Single GNN Layer

Idear: compress a set of vectors into a single vector
Two-step process: Message and Aggregation

1). Message Computation

Message function: $m_u^l=MSG^l(h_u^{l-1})$
Intuition: each node will create a message, which will be sent to other nodes later
Example: a linear layer $m_u^l=W^lh_u^{l-1}$

2). Message Aggregation

Intuition: each node will aggregate the messages from node $v$ 's neighbors
$h_v^l=AGG^l(\{m_u^l, u\in N(v)\})$
Example: sum, mean, max aggregator
Issue: information from node $v$ itself could get lost (computation of $h_v^l$ does not directly depend on $h_v^{l-1}$ )
Solution: include $h_v^{l-1}$ when computing $h_v^l$

Message: compute message from node $v$ itself
Perform a different message computation $m_v^l=B^lh_v^{l-1}$
Aggregation: after aggregating from neighbors, we can aggregate message from node $v$ itself via concatenation or summation
$h_v^l=\text{CONCAT}(AGG^l(\{m_u^l, u\in N(v)\}), m_v^l)$
Nonlinearity (activation): Adds expressiveness to message or aggregation

2. Classical GNN Layers

1). Graph Convolutional Networks (GCNs)

$h_v^l=\sigma (W^l\sum_{u\in N(v)}\frac{h_u^{l-1}}{|N(v)|})=\sigma (\sum_{u\in N(v)}W^l\frac{h_u^{l-1}}{|N(v)|})$
Message: each neighbor $m_u^l=\frac{1}{|N(v)|}W^lh_u^{l-1}$ (normalized by node degree)
Aggregation: sum over messages from neighbors, then apply activation $h_v^l=\sigma (\text{Sum}(\{m_u^l, u\in N(v)\}))$

2). GraphSAGE

$h_v^l=\sigma(W^l\cdot\text{CONCAT}(h_v^{l-1}, AGG^l(\{h_u^{l-1}, u\in N(v)\})))$

a). GraphSAGE neighbor aggregation

Mean: take a weighted average of neighbors (GCN)
$\sum_{u\in N(v)}\frac{h_u^{l-1}}{|N(v)|}$
Pool: transform neighbor vectors and apply symmetric vector function (mean/max)
$\text{Mean}(\{\text{MLP}(h_u^{l-1}), u\in N(v)\})$
LSTM: apply LSTM to reshuffled of neghbors
$\text{LSTM}([h_u^{l-1}, \forall u \in \pi(N(v))])$

b). $L_2$ normalization

Optional: apply $L_2$ normalization to $h_v^l$ at every layer
$h_v^l\leftarrow\frac{h_v^l}{||h_v^l||_2} \forall v \in V$ where $||u||_2=\sqrt{\sum_iu_i^2}$ ( $L_2$ -norm)
Without $L_2$ normalization, the embedding vectors have different scales for vectors
In some cases, normalization of embedding results in performance improvement
After $L_2$ normalization, all vectors will have the same $L_2$ -norm

3). Graph Attention Networks (GATs)

a). Not all nodes’ neighbors are equally important

In GCN and GraphSAGE, $\frac{1}{|N(v)|}$ is the weighting factor (importance) of node $u$ 's message to node $v$ . It is defined explicitly based on the structural properties of the graph (node degree) and all neighbors $u\in N(v)$ are equally important to node $v$ .
The attention $\alpha_{vu}$ focuses on the important parts of the input data and fades out the rest.
Idea: the NN should devote more computing power on that small but important part of the data, which depends on the context and is learned through training.

$h_v^l=\sigma (\sum_{u\in N(v)}\alpha_{vu}W^lh_u^{l-1})$
Goal: specify arbitrary importance to different neighbors of each node in the graph.
Idea: compute embedding $h_v^l$ of each node in the graph following an attention strategy。

b). Attention mechanism

Let $\alpha_{vu}$ be computed as a byproduct of an attention mechanism $a$

Let $a$ compute attention coefficients $e_{vu}$ across pairs of nodes $u$ and $v$ based on their messages
$e_{vu}=a(W^lh_u^{l-1}, W^lh_v^{l-1})$
which indicates the importance of $u$ 's message to node $v$ .
Normalize $e_{vu}$ into the final attention weight $\alpha_{vu}$ by the softmax function
$\alpha_{vu}=\frac{\exp(e_{vu})}{\sum_{k\in N(v)}\exp(e_{vk})}$
Weighted sum based on the final attention weight $\alpha_{vu}$
$h_v^l=\sigma (\sum_{u\in N(v)}\alpha_{vu}W^lh_u^{l-1})$

Form of attention mechanism $a$ : the approach is agnostic to the choice of $a$
Example: use a simple single-layer neural network ( $a$ have trainable parameters in the Linear layer)
$e_{AB}=a(W^lh_A^{l-1}, W^lh_B^{l-1})=\text{Linear}(\text{Concat}(W^lh_A^{l-1}, W^lh_B^{l-1}))$
Parameters of $a$ are trained together with weight matrices (i.e., para. of $W^l$ ) in an end-to-end fashion.

c). Multi-head attention

To be updated

d). Benefits of attention mechanism

Key benefit: allow for (implicitly) specifying different importance values to different neighbors

Computationally efficient: computation and aggregation can be parallelized
Storage efficient: sparse matrix operations do not require more than $O (V + E)$ enties to be sotred; fixed number of parameters, irrespective of graph size
Localized: only attends over local network neighbors
Inductive capability: a shared edge-wise mechanism that does not depend on the global graph structure

e). GNN layer in practice

We can include modern deep learning modules that proved to be useful in many domains

Batch normalization: stabilize NN training
Dropout: prevent overfitting
Attention/Gating: control the importantce of a message
More: any other deep learning modules

3. Stacking GNN Layers

0). How to Connect GNN Layers into a GNN?

Stack layers sequentially (standard way)
Input: initial raw node feature $x_v$
Output: node embeddings $h_v^l$ after $L$ GNN layers
Ways of adding skip connections

1). The Over-smoothing Problem

Issue: all the node embeddings converge to the same value after stacking many GNN layers. This is bad because we want to use node embeddings to differentiate nodes

a). Receptive field of a GNN

Receptive field: the set of nodes that determinte the embedding of a node of interest
In a $K$ -layer GNN, each node has a receptive field of $K$ -hop neighborhood. The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)

b). Receptive field & over-smoothing

Stack many GNN layers $\rightarrow$ Nodes will have highly-overlapped receptive fields $\rightarrow$ Node embeddings will be highly similar $\rightarrow$ Suffer from the over-smoothing problem

c). Be cautious when stacking GNN layers

Unlike NN in other domains, adding more GNN layers does not always help

Step 1: analyze the necessary receptive field to solve the problem (e.g., by computing the diameter of the graph)
Step 2: set number of GNN layers $L$ to be a bit more than the receptive field. Do not set $L$ to be unnecessarily large.

2). Expressive Power for Shallow GNNs

a). Increase the expressive power within each GNN layer

In our previous examples, each transformation or aggregation function only include one linear layer
We can make aggregation and transformation become a DNN

b). Add layers that do not pass messages

A GNN does not necessarily only contain GNN layers. We can add MLP layers before and after GNN layers as preprocessing layers and postprocessing layers.

Preprocessing layers: important when encoding node features is necessary (e.g., when nodes represent images / text)
Postprocessing layers: important when reasoning / transformation over node embeddings are needed (graph classification, knowledge graphs)

In practice, adding these layers work great.

3). Add skip connections in GNNs

Observation from over-smoothing: node embeddings in earlier GNN layers can sometimes better differentiate nodes.
Solutions: we can increase the impact of earlier layers on the final node embeddings by adding shortcuts in GNNs

Idea of skip connections
Before adding shortcuts: $F (x)$
After adding shortcuts: $F (x) + x$
Why do skip connections work?
Intuition: skip connections create a mixture of models.
$N$ skip connections lead to $2^N$ possible paths and each path could have up to $N$ modules. We automatically get a mixture of shallow GNNs and deep GNNs.
Other options: directly skip to the last layer

xbfu-xjtu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS224W: Machine Learning with Graphs - 07 Graph Neural Networks (GNN) 2: Design Space

CS224W: Machine Learning with Graphs - 07 Graph Neural Networks (GNN) 2: Design Space
复制链接

扫一扫