CS224W: Machine Learning with Graphs - 06 Graph Neural Networks (GNN) 1: GNN Model

GNN Model

0. Limitations of shallow embedding methods

  • O ( ∣ V ∣ ) O(|V|) O(V) parameters are needed: no sharing of parameters between nodes so every node has its own unique embedding
  • Inherently “transductive”: cannot generate embeddings for nodes not seen during training
  • Do not incorporate node features: features should be leveraged

1. Deep Graph Encoders

0). Deep Methods based on GNN

E N C ( v ) = ENC(v)= ENC(v)= multiple layers of non-linear transformations based on graph structure
Note: all deep encodes can be combined with node similarity functions

1). Modern ML Toolbox

Modern deep learning toolbox is designed for simple sequences and grids. But networks are far more complex

  • Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
  • No fixed node ordering or reference point
  • Often dynamic and have multimodal features

2. Basics of Deep Learning

To be updated

3. Deep Learning for Graphs

1). A Naive Approach

Join adjacency matrix and features then feed them into a deep neural network
Issues:

  • O ( ∣ V ∣ ) O(|V|) O(V) parameters
  • Not applicable to graph of different sizes
  • Sensitive to node ording
2). Convolutional Networks
a). From images to graphs

Goal: generalize convolutions beyond simple lattices and leverage node features/ attributes
Problem:

  • There is no fixed notion of locality or sliding window on the graph
  • Graph is permutation invariant

Idea: transform information at the neighbors and combine it:

  • Transform “message” h i h_i hi from neighbors: W i h i W_ih_i Wihi
  • Add them up: ∑ i W i h i \sum_i W_ih_i iWihi
b). Graph convolutional networks

Idea: node’s neighborhood defines a computation graph (determine node computation graph; propagate and transform information)
Basic approach: average information from neighbors and apply a neural network
h v 0 = x v h_v^0=x_v hv0=xv
h v l + 1 = σ ( W l ∑ u ∈ N ( v ) h u l ∣ N ( v ) ∣ + B l h v l ) , ∀ l ∈ { 0 , . . . , L − 1 } h_v^{l+1}=\sigma(W_l\sum_{u\in N(v)} \dfrac{h_u^l}{|N(v)|}+B_lh_v^l), \forall l\in \{0,...,L-1\} hvl+1=σ(WluN(v)N(v)hul+Blhvl),l{0,...,L1}
z v = h v L z_v=h_v^L zv=hvL
where

  • h v l h_v^l hvl: hidden representation of node v v v at layer l l l
  • W l W_l Wl: weight matrix for neighborhood aggregation
  • B l B_l Bl: weight matrix for transforming hidden vector of self
c). Matrix formulation

Many aggregations can be performed efficiently by (sparse) matrix operations
Let H l = [ h 1 l ⋯ h ∣ V ∣ l ] T H^l=[h_1^l \cdots h_{|V|}^l]^T Hl=[h1lhVl]T, then ∑ u ∈ N ( v ) h u l = A v H l \sum_{u\in N(v)}h_u^l=A_vH^l uN(v)hul=AvHl
Let D D D be diagonal matrix where D v v = D e g ( v ) = ∣ N ( v ) ∣ D_{vv} = Deg(v)=|N(v)| Dvv=Deg(v)=N(v) then D v v − 1 = 1 / ∣ N ( v ) ∣ D_{vv}^{-1} = 1/|N(v)| Dvv1=1/N(v)
Rewriting update function in matrix form
H l + 1 = σ ( A ~ H l W l T + H l B l T ) H^{l+1}=\sigma (\tilde AH^lW_l^T+H^lB_l^T) Hl+1=σ(A~HlWlT+HlBlT)
where A ~ = D − 1 A \tilde A=D^{-1}A A~=D1A
This implies that efficient sparse matrix multiplication can be used ( A ~ \tilde A A~ is sparse)

d). How to train a GNN
  • Node embedding z v z_v zv is a function of input graph
  • Supervised setting: minimize the loss L L L
    min ⁡ θ L ( y , f ( z v ) ) \min_\theta L(y, f(z_v)) θminL(y,f(zv))
    Example: node classification
    L = − ∑ v ∈ V y v log ⁡ ( σ ( z v T ) + ( 1 − y v ) log ⁡ ( 1 − σ ( z v T ) ) L=-\sum_{v\in V}y_v\log(\sigma(z_v^T)+(1-y_v)\log(1-\sigma(z_v^T)) L=vVyvlog(σ(zvT)+(1yv)log(1σ(zvT))
  • Unsupervised setting: No node label available so use the graph structure as the supervision.
    Similar nodes have similar embeddings
    L = ∑ z u , z v CrossEntropy ( y u v , DEC ( z u , z v ) ) L=\sum_{z_u,z_v}\text{CrossEntropy}(y_{uv}, \text{DEC}(z_u,z_v)) L=zu,zvCrossEntropy(yuv,DEC(zu,zv))
    where y u v = 1 y_{uv}=1 yuv=1 when node u u u and v v v are similar and DEC is the decoder (e.g., inner product)
    Node similarity can be anything such as random walks (node2vec, DeepWalk, struc2vec), matrix factorization, and node proximity in the graph
e). Model design: overview
  1. Define a neighborhood aggregation function
  2. Define a loss function on the embeddings
  3. Train a set of nodes
  4. Generate embeddings for nodes as needed
f). Inductive Capability

The same aggregation parameters ( W l W_l Wl and B l B_l Bl) are shared for all nodes: the number of model parameters is sublinear in ∣ V ∣ |V| V and we can generalize to unseen nodes (new graphs or new nodes).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值