CS224W: Machine Learning with Graphs - 07 Graph Neural Networks (GNN) 2: Design Space

Design Space

0. A General GNN Framework

1). Message
2). Aggregation

GNN layer = Message + Aggregation
Different instantiation under this perspective (GCN, GraphSAGE, GAT, …)

3). Layer Connectivity

connect GNN layers into a GNN

  • Stack layers sequentially
  • Ways of adding skip connections
4). Graph Manipulation

Idea: raw input graph ≠ \neq = computational graph

  • Graph feature augmentation
  • Graph structure manipulation
5). Learning Objective

How do we train a GNN

  • Supervised/unsupervised objectives
  • Node/edge/graph level objectives

1. A Single GNN Layer

Idear: compress a set of vectors into a single vector
Two-step process: Message and Aggregation

1). Message Computation

Message function: m u l = M S G l ( h u l − 1 ) m_u^l=MSG^l(h_u^{l-1}) mul=MSGl(hul1)
Intuition: each node will create a message, which will be sent to other nodes later
Example: a linear layer m u l = W l h u l − 1 m_u^l=W^lh_u^{l-1} mul=Wlhul1

2). Message Aggregation

Intuition: each node will aggregate the messages from node v v v's neighbors
h v l = A G G l ( { m u l , u ∈ N ( v ) } ) h_v^l=AGG^l(\{m_u^l, u\in N(v)\}) hvl=AGGl({mul,uN(v)})
Example: sum, mean, max aggregator
Issue: information from node v v v itself could get lost (computation of h v l h_v^l hvl does not directly depend on h v l − 1 h_v^{l-1} hvl1)
Solution: include h v l − 1 h_v^{l-1} hvl1 when computing h v l h_v^l hvl

  • Message: compute message from node v v v itself
    Perform a different message computation m v l = B l h v l − 1 m_v^l=B^lh_v^{l-1} mvl=Blhvl1
  • Aggregation: after aggregating from neighbors, we can aggregate message from node v v v itself via concatenation or summation
    h v l = CONCAT ( A G G l ( { m u l , u ∈ N ( v ) } ) , m v l ) h_v^l=\text{CONCAT}(AGG^l(\{m_u^l, u\in N(v)\}), m_v^l) hvl=CONCAT(AGGl({mul,uN(v)}),mvl)
  • Nonlinearity (activation): Adds expressiveness to message or aggregation

2. Classical GNN Layers

1). Graph Convolutional Networks (GCNs)

h v l = σ ( W l ∑ u ∈ N ( v ) h u l − 1 ∣ N ( v ) ∣ ) = σ ( ∑ u ∈ N ( v ) W l h u l − 1 ∣ N ( v ) ∣ ) h_v^l=\sigma (W^l\sum_{u\in N(v)}\frac{h_u^{l-1}}{|N(v)|})=\sigma (\sum_{u\in N(v)}W^l\frac{h_u^{l-1}}{|N(v)|}) hvl=σ(WluN(v)N(v)hul1)=σ(uN(v)WlN(v)hul1)
Message: each neighbor m u l = 1 ∣ N ( v ) ∣ W l h u l − 1 m_u^l=\frac{1}{|N(v)|}W^lh_u^{l-1} mul=N(v)1Wlhul1 (normalized by node degree)
Aggregation: sum over messages from neighbors, then apply activation h v l = σ ( Sum ( { m u l , u ∈ N ( v ) } ) ) h_v^l=\sigma (\text{Sum}(\{m_u^l, u\in N(v)\})) hvl=σ(Sum({mul,uN(v)}))

2). GraphSAGE

h v l = σ ( W l ⋅ CONCAT ( h v l − 1 , A G G l ( { h u l − 1 , u ∈ N ( v ) } ) ) ) h_v^l=\sigma(W^l\cdot\text{CONCAT}(h_v^{l-1}, AGG^l(\{h_u^{l-1}, u\in N(v)\}))) hvl=σ(WlCONCAT(hvl1,AGGl({hul1,uN(v)})))

a). GraphSAGE neighbor aggregation
  • Mean: take a weighted average of neighbors (GCN)
    A G G = ∑ u ∈ N ( v ) h u l − 1 ∣ N ( v ) ∣ AGG = \sum_{u\in N(v)}\frac{h_u^{l-1}}{|N(v)|} AGG=uN(v)N(v)hul1
  • Pool: transform neighbor vectors and apply symmetric vector function (mean/max)
    A G G = Mean ( { MLP ( h u l − 1 ) , u ∈ N ( v ) } ) AGG = \text{Mean}(\{\text{MLP}(h_u^{l-1}), u\in N(v)\}) AGG=Mean({MLP(hul1),uN(v)})
  • LSTM: apply LSTM to reshuffled of neghbors
    A G G = LSTM ( [ h u l − 1 , ∀ u ∈ π ( N ( v ) ) ] ) AGG = \text{LSTM}([h_u^{l-1}, \forall u \in \pi(N(v))]) AGG=LSTM([hul1,uπ(N(v))])
b). L 2 L_2 L2 normalization

Optional: apply L 2 L_2 L2 normalization to h v l h_v^l hvl at every layer
h v l ← h v l ∣ ∣ h v l ∣ ∣ 2 ∀ v ∈ V h_v^l\leftarrow\frac{h_v^l}{||h_v^l||_2} \forall v \in V hvlhvl2hvlvV where ∣ ∣ u ∣ ∣ 2 = ∑ i u i 2 ||u||_2=\sqrt{\sum_iu_i^2} u2=iui2 ( L 2 L_2 L2-norm)
Without L 2 L_2 L2 normalization, the embedding vectors have different scales for vectors
In some cases, normalization of embedding results in performance improvement
After L 2 L_2 L2 normalization, all vectors will have the same L 2 L_2 L2-norm

3). Graph Attention Networks (GATs)
a). Not all nodes’ neighbors are equally important
  • In GCN and GraphSAGE, 1 ∣ N ( v ) ∣ \frac{1}{|N(v)|} N(v)1 is the weighting factor (importance) of node u u u's message to node v v v. It is defined explicitly based on the structural properties of the graph (node degree) and all neighbors u ∈ N ( v ) u\in N(v) uN(v) are equally important to node v v v.
  • The attention α v u \alpha_{vu} αvu focuses on the important parts of the input data and fades out the rest.
  • Idea: the NN should devote more computing power on that small but important part of the data, which depends on the context and is learned through training.

h v l = σ ( ∑ u ∈ N ( v ) α v u W l h u l − 1 ) h_v^l=\sigma (\sum_{u\in N(v)}\alpha_{vu}W^lh_u^{l-1}) hvl=σ(uN(v)αvuWlhul1)
Goal: specify arbitrary importance to different neighbors of each node in the graph.
Idea: compute embedding h v l h_v^l hvl of each node in the graph following an attention strategy。

b). Attention mechanism

Let α v u \alpha_{vu} αvu be computed as a byproduct of an attention mechanism a a a

  • Let a a a compute attention coefficients e v u e_{vu} evu across pairs of nodes u u u and v v v based on their messages
    e v u = a ( W l h u l − 1 , W l h v l − 1 ) e_{vu}=a(W^lh_u^{l-1}, W^lh_v^{l-1}) evu=a(Wlhul1,Wlhvl1)
    which indicates the importance of u u u's message to node v v v.
  • Normalize e v u e_{vu} evu into the final attention weight α v u \alpha_{vu} αvu by the softmax function
    α v u = exp ⁡ ( e v u ) ∑ k ∈ N ( v ) exp ⁡ ( e v k ) \alpha_{vu}=\frac{\exp(e_{vu})}{\sum_{k\in N(v)}\exp(e_{vk})} αvu=kN(v)exp(evk)exp(evu)
  • Weighted sum based on the final attention weight α v u \alpha_{vu} αvu
    h v l = σ ( ∑ u ∈ N ( v ) α v u W l h u l − 1 ) h_v^l=\sigma (\sum_{u\in N(v)}\alpha_{vu}W^lh_u^{l-1}) hvl=σ(uN(v)αvuWlhul1)

Form of attention mechanism a a a: the approach is agnostic to the choice of a a a
Example: use a simple single-layer neural network ( a a a have trainable parameters in the Linear layer)
e A B = a ( W l h A l − 1 , W l h B l − 1 ) = Linear ( Concat ( W l h A l − 1 , W l h B l − 1 ) ) e_{AB}=a(W^lh_A^{l-1}, W^lh_B^{l-1})=\text{Linear}(\text{Concat}(W^lh_A^{l-1}, W^lh_B^{l-1})) eAB=a(WlhAl1,WlhBl1)=Linear(Concat(WlhAl1,WlhBl1))
Parameters of a a a are trained together with weight matrices (i.e., para. of W l W^l Wl) in an end-to-end fashion.

c). Multi-head attention

To be updated

d). Benefits of attention mechanism

Key benefit: allow for (implicitly) specifying different importance values to different neighbors

  • Computationally efficient: computation and aggregation can be parallelized
  • Storage efficient: sparse matrix operations do not require more than O ( V + E ) O(V+E) O(V+E) enties to be sotred; fixed number of parameters, irrespective of graph size
  • Localized: only attends over local network neighbors
  • Inductive capability: a shared edge-wise mechanism that does not depend on the global graph structure
e). GNN layer in practice

We can include modern deep learning modules that proved to be useful in many domains

  • Batch normalization: stabilize NN training
  • Dropout: prevent overfitting
  • Attention/Gating: control the importantce of a message
  • More: any other deep learning modules

3. Stacking GNN Layers

0). How to Connect GNN Layers into a GNN?
  • Stack layers sequentially (standard way)
    Input: initial raw node feature x v x_v xv
    Output: node embeddings h v l h_v^l hvl after L L L GNN layers
  • Ways of adding skip connections
1). The Over-smoothing Problem

Issue: all the node embeddings converge to the same value after stacking many GNN layers. This is bad because we want to use node embeddings to differentiate nodes

a). Receptive field of a GNN

Receptive field: the set of nodes that determinte the embedding of a node of interest
In a K K K-layer GNN, each node has a receptive field of K K K-hop neighborhood. The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)

b). Receptive field & over-smoothing

Stack many GNN layers → \rightarrow Nodes will have highly-overlapped receptive fields → \rightarrow Node embeddings will be highly similar → \rightarrow Suffer from the over-smoothing problem

c). Be cautious when stacking GNN layers

Unlike NN in other domains, adding more GNN layers does not always help

  • Step 1: analyze the necessary receptive field to solve the problem (e.g., by computing the diameter of the graph)
  • Step 2: set number of GNN layers L L L to be a bit more than the receptive field. Do not set L L L to be unnecessarily large.
2). Expressive Power for Shallow GNNs
a). Increase the expressive power within each GNN layer
  • In our previous examples, each transformation or aggregation function only include one linear layer
  • We can make aggregation and transformation become a DNN
b). Add layers that do not pass messages

A GNN does not necessarily only contain GNN layers. We can add MLP layers before and after GNN layers as preprocessing layers and postprocessing layers.

  • Preprocessing layers: important when encoding node features is necessary (e.g., when nodes represent images / text)
  • Postprocessing layers: important when reasoning / transformation over node embeddings are needed (graph classification, knowledge graphs)

In practice, adding these layers work great.

3). Add skip connections in GNNs

Observation from over-smoothing: node embeddings in earlier GNN layers can sometimes better differentiate nodes.
Solutions: we can increase the impact of earlier layers on the final node embeddings by adding shortcuts in GNNs

  • Idea of skip connections
    Before adding shortcuts: F ( x ) F(x) F(x)
    After adding shortcuts: F ( x ) + x F(x)+x F(x)+x
  • Why do skip connections work?
    Intuition: skip connections create a mixture of models.
    N N N skip connections lead to 2 N 2^N 2N possible paths and each path could have up to N N N modules. We automatically get a mixture of shallow GNNs and deep GNNs.
  • Other options: directly skip to the last layer
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值