CS224W: Machine Learning with Graphs - 08 GNN Augmentation and Training

最新推荐文章于 2024-11-10 17:23:57 发布

xbfu-xjtu

最新推荐文章于 2024-11-10 17:23:57 发布

阅读量672

点赞数

文章标签：机器学习图计算深度学习神经网络

本文链接：https://blog.csdn.net/fxb163/article/details/122283645

版权

GNN Augmentation and Training

0. A General GNN Framework

Idea: raw input graph $\neq$ computational graph

Graph feature augmentation
Graph structure manipulation

1). Why Agument Graphs?

Our assumption so far has been: raw input graph = computational graph
Reasons for breaking this assumption

Features
The input graph lacks features
Graph structure
The graph is too sparse $\rightarrow$ inefficient message passing
The graph is too dense $\rightarrow$ message passing is too costly
The graph is too large $\rightarrow$ cannot fit the computational graph into a GPU

It is unlikely that the input graph happens to be the optimal computation graph for embeddings.

2). Graph Augmentation Approaches

Graph feature augmentation
The input graph lacks features $\rightarrow$ feature augmentation
Graph structure augmentation
The graph is too sparse $\rightarrow$ add virtual nodes / edges
The graph is too dense $\rightarrow$ sample neighbors when doing message passing
The graph is too large $\rightarrow$ sample subgraphs to compute embeddings

1. Feature Augmentation on Graphs

Why Do We Need Feature Augmentation?

Input graph does not have node features
Certain structures are hard to learn by GNN

1). Input Graph Does Not Have Node Features

This is common when we only have the adjacent matrix
Standard approaches:

Assign constant values to nodes
Assing unique IDs to nodes (one-hot vectors)

	Constant node feature	One-hot node feature
Expressive power	Medium. All nodes are identical but GNN can still learn from the graph structure	High. Each node has a unique ID, so node-specific information can be stored
Inductive learning (generalize to unseen nodes)	High. Simple to generalize to new nodes	Low. Cannot generalize to new nodes: new nodes introduce new IDs, GNN does not know how to embed unseen IDs
Computational cost	Low. Only 1 dimensional feature	High. $O(\\|V\\|)$ dimensional feature, cannot apply to large graphs
Use cases	Any graph. inductive settings (generalize to new nodes)	Small graph, transductive settings (no new nodes)

2). Certain Structures Are Hard to Learn by GNN

Example: GNN cannot learn the length of a cycle because all the nodes in the graph have degree of 2 and the computational graphs will be the same binary tree.
Solution: we can use cycle count as augmented node features
Other solutions: node degree, clustering coefficient, PageRank, Centrality, …

2. Structure Augmentation on Graphs

1). Add Virtual Nodes / Edges

Motivation: augment sparse graphs

a). Add virtua edges

Common approach: connect 2-hop neighbors via virtual edges
Intuition: instead of using adjacent matrix $A$ for GNN computation, use $A+A^2$
Use cases: bipartite graphs
Author-to-papers: 2-hop virtual edges make an author-author collaboration graph

b). Add virtua nodes

The virtual node will connect to all the nodes in the graph and all nodes will have a distance of two: Node A - Virtual node - Node B
Benefits: greatly improve message passing in sparse graphs

2). Node Neighborhood Sampling

Idea: (randomly) sample a node’s neighborhood for message passing
Example: we can randomly choose 2 neighbors to pass messages in a given layer; in the next layer when we compute the embeddings we can sample different neighbors. In expectation, we get embeddings similar to the case where all the neighbors are used.
Befenits: greatly reduce computational cost allowing for scaling to large graph

3. Training GNNs

1). GNN Prediction Heads

Idea: different task levels require different prediction heads

a). Node-level prediction

Directly make prediction using node embeddings
After GNN computation, we have $d$ -dim node embeddings { $h_v^l\in R^d, \forall v\in G$ }
For a $k$ -way prediction problem

Classification: classify among $k$ categories
Regression: regress on $k$ targets

$\hat{y}_v=Head_{node}(h_v^l)=W^Hh_v^l$
where $W^H\in R^{k\times d}$ maps node embeddings from $h_v^l\in R^d$ to $\hat{y}_v\in R^k$ so that we can compute the loss.

b). Edge-level prediction

Make prediction using pairs of node embeddings
For a $k$ -way prediction problem
$\hat{y}_{uv}=Head_{edge}(h_u^l, h_v^l)$
Options for $Head_{edge}(h_u^l, h_v^l)$

Concatenation + Linear
$\hat{y}_{uv}=\text{Linear}(\text{Concat}(h_u^l, h_v^l))$
where $\text{Linear}(\cdot)$ will map $2 d$ -dim embeddings to $k$ -dim embeddings
Dot product
For 1-way prediction (e.g., link prediction: predict the existense of an edge):
$\hat{y}_{uv}=(h_u^l)^Th_v^l$
Apply to $k$ -way prediction, similar to multi-head attention: $W^1, \cdots, W^k$ trainable weight matrices
$\hat{y}_{uv}^1=(h_u^l)^TW^1h_v^l$
$\cdots$
$\hat{y}_{uv}^k=(h_u^l)^TW^kh_v^l$
$\hat{y}_{uv}=\text{Concat}(\hat{y}_{uv}^1, \cdots,\hat{y}_{uv}^k)\in R^k$

c). Graph-level prediction

Make prediction using all node embeddings in the graph
For a $k$ -way prediction problem
$\hat{y}_G=Head_{graph}(\{h_v^l\in R^d, \forall v\in G\})$
where $Head_{graph}(\cdot)$ is similar to $\text{AGG}(\cdot)$ in a GNN layer
Options for $Head_{graph}(\{h_v^l\in R^d, \forall v\in G\})$ in small graphs

Global mean pooling: $\hat{y}_G=\text{Mean}(\{h_v^l\in R^d, \forall v\in G\})$
Global max pooling: $\hat{y}_G=\text{Max}(\{h_v^l\in R^d, \forall v\in G\})$
Global sum pooling: $\hat{y}_G=\text{Sum}(\{h_v^l\in R^d, \forall v\in G\})$

Issue: global pooling over a (large) graph will lose information
Solution: aggregate all the node embeddings hierarchically (DiffPool)

2). Labels

a). Supervised labels on graphs

Supervised labels com from specific use cases

Node labels $y_v$ : in a citation network, which subject area does a node belong to
Edge labels $y_{uv}$ : in a transaction network, whether an edge is fraudlent
Graph labels $y_G$ : among molecular graphs, the drug likeness of graphs

Advice: reduce your task to node / edge / graph labels since they are easy to work with

b). Unsupervised labels on graphs

Problem: sometimes we only have a graph withour any external labels
Solution: “self-supervised learning”; we can find supervision labels within the graph

Node labels $y_v$ : node statistics such as clustering coefficient, PageRank, …
Edge labels $y_{uv}$ : link prediction hiding the edge between two nodes and predicting if there should be a link
Graph labels $y_G$ : graph statistics predicting if two graphs isomorphic

3). Loss Function

Classification loss: cross entropy (CE) is a very common loss function in classification
Regression loss: we often ise mean square error (MSE) aka L2 loss

4). Evaluation Metrics

a). Regression

Root MSE (RMSE)
Mean absulute error (MAE)

b). Classification

Multi-class classification
$\frac{1[\text{argmax}(\hat{y}^i)=y^i]}{N}$
Binary-class classification
Accuracy: metric sensitive to classification
Precision / recall: metric sensitive to classification
ROC AUC: metric agnostic to classification

	Actually Positive (1)	Actually Negative (0)
Predicted Positive (1)	True Positives (TPs)	False Positives (TPs)
Predicted Negative (0)	False Negatives (TNs)	True Negatives (TNs)

Accuracy: $\frac{\text{TP+TN}}{\text{TP+TN+FP+FN}}=\frac{\text{TP+TN}}{|\text{Dataset}|}$
Precision (P): $\frac{\text{TP}}{\text{TP+FP}}$
Recall (R): $\frac{\text{TP}}{\text{TP+FN}}$
F1-score: $\frac{\text{2P*R}}{\text{P+R}}$
ROC curve: captures the tradeoff in TPR ( $\frac{\text{TP}}{\text{TP+FN}}$ =Reacll) and FPR ( $\frac{\text{FP}}{\text{FP+TN}}$ ) as the classification threshold is varied for a binary classifier.
ROC AUC: area under the ROC curve

5). Dataset Split: Fixed / Random Split

Fixed split: split dataset once

Training set: used for optimizing GNN parameters
Validation set: develop model / hyperparameters
Test set: held out until reporting final performance

Random split: randomly split dtaset into training / validation / test set

a). Why splitting graphs is special

Image classification: each data point is an image and data points are independent
Node classification: each data point is a node and data points are NOT independent

Solutions

Transductive setting: the entire input graph can be observed in all the dataset splits (training. validation and test set) and only the (node) labels are split
Inductive setting: break the edges between splits to get multiple independent graphs

	Transductive setting	Inductive setting
training/validation/test	On the same entire graph	On different graphs
Applications	Node/edge tasks	Node/edge/graph tasks

b). Example: node classification

Transductive node classification
All the splits can observe the entire graph structure but can only observe the labels of their respective nodes.
Inductive node classification
Each splits contains an independent graph.

c). Example: graph classification

Only the inductive setting is well defined for graph classification because we have to test on unseen graphs.

d). Example: link prediction

Goal of link prediction: predict missing edges
Link prediction is an unsupervised / self-supervised task. We nned to create the labels and dataset splits on our own
Concretely, we need to hide some edges from the GNN and let the GNN predict if the edges exist

Setting up link prediction
Split edges twice

Step 1: Assign 2 types of edges in the original graph - message edges (for GNN message passing) and supervision edges (for computing objectives). After Step 1, only message edges remain in the graph and supervision edges are used as supervision for edge prediction made by the model, will not be fed into GNN.
Step 2: split edges into training / validation / test

Option 1 for Step 2: inductive link prediction split. Each inductive split contains an independent graph and each graph has two types of edges - message edges and supervision edges

Option 2 for Step 2: transductive link prediction split (default setting). The entire graph can be observed in all dataset splits by definition of “transductive”. But since edges are both part of graph structure and the supervision, we need to hold out validation / test edges. To train the training set, we further need to hold out supervision edges for the training set

1). At training time: use training message edges to predict training supervision edges
2). At validation time: use training message edges & training supervision edges to predict validation edges
3). At test time, use training message edges & training supervision edges & validation edges to predict test edges