【转载】Graph Convolutional Networks (GCN)

图卷积网络GCN:结构信息的利用

原文链接:https://ai.plainenglish.io/graph-convolutional-networks-gcn-baf337d5cb6b

In this post, we’re gonna take a close look at one of the well-known Graph neural networks named GCN. First, we’ll get the intuition to see how it works, then we’ll go deeper into the maths behind it.

Why Graphs?

Examples of graphs. (Picture from [1])

Tasks on Graphs

  • Node classification: Predict a type of a given node
  • Link prediction: Predict whether two nodes are linked
  • Community detection: Identify densely linked clusters of nodes
  • Network similarity: How similar are two (sub)networks

Machine Learning Lifecycle

In the graph, we have node features (the data of nodes) and the structure of the graph (how nodes are connected).

For the former, we can easily get the data from each node. But when it comes to the structure, it is not trivial to extract useful information from it. For example, if 2 nodes are close to one another, should we treat them differently to other pairs? How about high and low degree nodes? In fact, each specific task can consume a lot of time and effort just for Feature Engineering, i.e., to distill the structure into our features.

Feature engineering on graphs. (Picture from [1])

It would be much better to somehow get both the node features and the structure as the input, and let the machine to figure out what information is useful by itself.

That’s why we need Graph Representation Learning.

We want the graph can learn the “feature engineering” by itself. (Picture from [1])

Graph Convolutional Networks (GCNs)

Paper: Semi-supervised Classification with Graph Convolutional Networks (2017) [3]

GCN is a type of convolutional neural network that can work directly on graphs and take advantage of their structural information.

it solves the problem of classifying nodes (such as documents) in a graph (such as a citation network), where labels are only available for a small subset of nodes (semi-supervised learning).

Example of Semi-supervised learning on Graphs. Some nodes don't have labels (unknown nodes).

Main Ideas

As the name “Convolutional” suggests, the idea was from Images and then brought to Graphs. However, when Images have a fixed structure, Graphs are much more complex.

Convolution idea from images to graphs. (Picture from [1])

The general idea of GCN: For each node, we get the feature information from all its neighbors and of course, the feature of itself. Assume we use the average() function. We will do the same for all the nodes. Finally, we feed these average values into a neural network.

In the following figure, we have a simple example with a citation network. Each node represents a research paper, while edges are the citations. We have a pre-process step here. Instead of using the raw papers as features, we convert the papers into vectors (by using NLP embedding, e.g., tf–idf, Doc2Vec).

Let’s consider the orange node. First off, we get all the feature values of its neighbors, including itself, then take the average. The result will be passed through a neural network to return a resulting vector.

The main idea of GCN. Consider the orange node in the middle. First, we take the average of all its neighbors, including itself. After that, the average value is passed through a neural network. Note that, in GCN, we simply use a fully connected layer. In this example, we get 2-dimension vectors as the output (2 nodes at the fully connected layer).

In practice, we can use more sophisticated aggregate functions rather than the average function. We can also stack more layers on top of each other to get a deeper GCN. The output of a layer will be treated as the input for the next layer.

Example of 2-layer GCN: The output of the first layer is the input of the second layer.
Let’s take a closer look at the maths to see how it really works.

Intuition and the Maths behind

First, we need some notations:

Given an undirected graph G = ( V , E ) G = (V, E) G=(V,E) with N N N nodes v i ∈ V v_i \in V viV, edges ( v i , v j ) ∈ E (v_i, v_j) \in E (vi,vj)E, an adjacency matrix A ∈ R N × N A \in R^{N×N} ARN×N (binary or weighted), degree matrix D i i = ∑ j A i j D_{ii} = \sum_j Α_{ij} Dii=jAij and feature vector matrix X ∈ R N × C X \in R^{N×C} XRN×C (N is #nodes, C is the #dimensions of a feature vector).

Let’s consider a graph G as below.

From the graph G, we have an adjacency matrix A and a Degree matrix D. We also have feature matrix X.

How can we get all the feature values from neighbors for each node? The solution lies in the multiplication of A and X.

Take a look at the first row of the adjacency matrix, we see that node A has a connection to E. The first row of the resulting matrix is the feature vector of E, which A connects to (Figure below). Similarly, the second row of the resulting matrix is the sum of feature vectors of D and E. By doing this, we can get the sum of all neighbors’ vectors.

Calculate the first row of the “sum vector matrix” AX

There are still some things that need to improve here.

  1. We miss the feature of the node itself. For example, the first row of the result matrix should contain features of node A too.
  2. Instead of the sum() function, we need to take the average, or even better, the weighted average of neighbors’ feature vectors. Why don’t we use the sum() function? The reason is that when using the sum() function, high-degree nodes are likely to have huge v vectors, while low-degree nodes tend to get small aggregate vectors, which may later cause exploding or vanishing gradients (e.g., when using sigmoid). Besides, Neural networks seem to be sensitive to the scale of input data. Thus, we need to normalize these vectors to get rid of the potential issues.

In Problem (1), we can fix it by adding an Identity matrix I to A to get a new adjacency matrix Ã.

A ~ = A + λ I N \tilde{A}=A+\lambda I_N A~=A+λIN

Pick lambda = 1 (the feature of the node itself is just important as its neighbors), we have à = A + I. Note that we can treat lambda as a trainable parameter, but for now, just assign the lambda to 1, and even in the paper, lambda is just simply assigned to 1.

By adding a self-loop to each node, we have the new adjacency matrix.

Problem (2): For matrix scaling, we usually multiply the matrix by a diagonal matrix. In this case, we want to take the average of the sum feature, or mathematically, to scale the sum vector matrix ÃX according to the node degrees. The gut feeling tells us that our diagonal matrix used to scale here is something related to the Degree matrix D̃ (Why D̃, not D? Because we’re considering Degree matrix D̃ of new adjacency matrix Ã, not A anymore).

The problem now becomes how we want to scale/normalize the sum vectors? In other words:

How we pass the information from neighbors to a specific node?

We would start with our old friend average. In this case, D̃ inverse (i.e., D̃^{-1}) comes into play. Basically, each element in D̃ inverse is the reciprocal of its corresponding term of the diagonal matrix D̃.

For example, node A has a degree of 2, so we multiple the sum vectors of node A by 1/2, while node E has a degree of 5, we should multiple the sum vector of E by 1/5, and so on.

Thus, by taking the multiplication of D̃ inverse and X, we can take the average of all neighbors’ feature vectors (including itself).

So far so good. But you may ask How about the weighted average()?. Intuitively, it should be better if we treat high and low degree nodes differently.

Let’s take a deeper look at the average() approach that we’ve just mentioned. From the Associative property of matrix multiplication, for any three matrices A \mathrm{A} A, B \mathrm{B} B , and C \mathrm{C} C,( A B ) C = A ( B C \mathrm{A} B) \mathrm{C}=A(B \mathrm{C} AB)C=A(BC) . Rather D ~ − 1 ( A ~ X ) \tilde{D}^{-1}(\tilde{A} X) D~1(A~X) , we consider ( D ~ − 1 A ~ ) X \left(\tilde{D}^{-1} \tilde{A}\right) X (D~1A~)X , so D ~ − 1 \tilde{D}^{-1} D~1 can also be seen as the scale factor of A ~ \tilde{A} A~ . From this perspective, each row i of A ~ \tilde{A} A~ will be scaled by D ~ i i \tilde{D}_{i i} D~ii (Figure below). Note that A ~ \tilde{A} A~ is a symmetric matrix, it means row i is the same value as column i . If we scale each row i by D ~ i i \tilde{D}_{i i} D~ii , intuitively, we have a feeling that we should do the same for its corresponding column too.

Mathematically, we’re scaling A ~ i j \tilde{A}_{i j} A~ij only by D ~ i i \tilde{D}_{i i} D~ii . We’re ignoring the j index. So, what would happen when we scale A ~ i j \tilde{A}_{i j} A~ij by both D ~ i i \tilde{D}_{i i} D~ii and D ~ j j \tilde{D}_{j j} D~jj ?

We’re just scaling by rows but ignoring their corresponding columns (dash boxes)

We try new scaling strategy: instead of using D ~ − 1 A ~ X \tilde{D}^{-1}\tilde{A}X D~1A~X, we use D ~ − 1 A ~ D ~ − 1 X \tilde{D}^{-1}\tilde{A}\tilde{D}^{-1}X D~1A~D~1X.

Add a new scaler for columns.

The new scaler gives us the “weighted” average. What we are doing here is to put more weights on the nodes that have low-degree and reduce the impact of high-degree nodes. The idea of this weighted average is that we assume low-degree nodes would have bigger impacts on their neighbors, whereas high-degree nodes generate lower impacts as they scatter their influence at too many neighbors.

When aggregating feature at node B, we assign the biggest weight for node B itself (degree of 3) and the lowest weight for node E (degree of 5)

One more minor note: When using two scalers( D ~ i i \tilde{D}_{ii} D~ii and D ~ j j \tilde{D}_{jj} D~jj ), we actually normalize twice, one time for the row as before, and another time for the column. It would make sense if we rebalance by modifying D ~ i i D ~ j j \tilde{D}_{ii}\tilde{D}_{jj} D~iiD~jj to D ~ i i D ~ j j \sqrt {\tilde{D}_{ii}\tilde{D}_{jj}} D~iiD~jj ; ln other words, instead of using D ~ − 1 \tilde{D}^{-1} D~1 , we use D ~ − 1 / 2 \tilde{D}^{-1/2} D~1/2 .So,we further alter the formula to D ~ − 1 / 2 A ~ D ~ − 1 / 2 X \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}X D~1/2A~D~1/2X, which is exactly used in the paper.

Because we normalize twice, we change “-1” to “-1/2”

Quick summary so far:

  • A ~ X \tilde{A}X A~X: sum of all neighbors’ feature vectors, including itself.
  • D ~ − 1 A ~ X : \tilde{D}^{-1}\tilde{A}X{:} D~1A~X:averageof all eighbors’ feature vectors (including itself. Adjacency matrix is scaled by rows.
  • D ~ − 1 / 2 A ~ D ~ − 1 / 2 X \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}X D~1/2A~D~1/2X: averageof all neighbors’ feature vectors (including itself). The adjacency matrix is scaled by both rows and columns. By doing this, we get the weighted average preferring on low-degree nodes.

Ok, now let’s put things together.

Let’s call A ^ = D ~ − 1 / 2 A ~ D ~ − 1 / 2 \hat{A}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} A^=D~1/2A~D~1/2 just for aclear view. With 2-layer GCN, we have the form of our forward mode as below.

Recall that N is # nodes, C is # dimensions of feature vectors. We also have His # nodes in the hidden layer, and Fis the dimensions of resulting vectors.

For example, we have a multi-classification problem with 10 classes, F will be set to 10. After having the 10-dimension vectors at layer 2, we pass these vectors through a softmax function for the prediction.

The Loss function is simply calculated by the cross-entropy error over all labeled examples, where Y_{l} is the set of node indices that have labels.

L = − ∑ l ∈ Y L ∑ f = 1 F Y l f ln ⁡ Z l f \mathcal{L}=-\sum_{l\in\mathcal{Y}_L}\sum_{f=1}^FY_{lf}\ln Z_{lf} L=lYLf=1FYlflnZlf

The number of layers

The meaning of #layers

The number of layers is the farthest distance that node features can travel. For example, with 1 layer GCN, each node can only get the information from its neighbors. The gathering information process takes place independently, at the same time for all the nodes.

When stacking another layer on top of the first one, we repeat the gathering info process, but this time, the neighbors already have information about their own neighbors (from the previous step). It makes the number of layers as the maximum number of hops that each node can travel. So, depends on how far we think a node should get information from the networks, we can config a proper number for #layers. But again, in the graph, normally we don’t want to go too far. With 6–7 hops, we almost get the entire graph, which makes the aggregation less meaningful.

Example: Gathering info process with 2 layers of target node i

How many layers should we stack the GCN?

In the paper, the authors also conducted some experiments with shallow and deep GCNs. From the figure below, we see that the best results are obtained with a 2- or 3-layer model. Besides, with a deep GCN (more than 7 layers), it tends to get bad performances (dashed blue line). One solution is to use the residual connections between hidden layers (purple line).

Performance over #layers. Picture from the paper [3]

Take home notes

  • GCNs are used for semi-supervised learning on the graph.
  • GCNs use both node features and the structure for the training.
  • The main idea of the GCN is to take the weighted average of all neighbors’ node features (including itself): Lower-degree nodes get larger weights. Then, we pass the resulting feature vectors through a neural network for training.
  • We can stack more layers to make GCNs deeper. Consider residual connections for deep GCNs. Normally, we go for 2 or 3-layer GCN.
  • Maths Note: When seeing a diagonal matrix, think of matrix scaling.
  • A demo for GCN with StellarGraph library here [5]. The library also provides many other GNN algorithms. In addition, DGL [7] is another good library for GNNs. While StellarGraph helps us to set up the model easily, DGL aims to provide a more flexible structure which comes in handy when customizing our models.

Note from the authors of the paper.

The framework is currently limited to undirected graphs (weighted or unweighted). However, it is possible to handle both directed edges and edge features by representing the original directed graph as an undirected bipartite graph with additional nodes that represent edges in the original graph.

In the paper, the authors present the process from a Spectral perspective, while in this post, we go backward from the formula of GCN to understand how it handles the input graph from a Spatial point of view.

What’s next?

With GCNs, it seems we can make use of both the node features and the structure of the graph. However, what if the edges have different types? Should we treat each relationship differently? How to aggregate neighbors in this case? (R-GCN)

From another perspective, how can we further improve the GCN model? Is using fixed weights preferring on low-degree nodes good enough? Is there any way we can let the model learn the weights automatically by itself (Graph Attention Networks)? How can we deal with large graphs which can not be fitted in memory at once, or use more complex aggregators (GraphSAGE)?

In the next post on the graph topic, we will look into some more sophisticated GNN methods.

How to deal with different relationships on the edges (brother, friend,….)?

REFERENCES

[1] Excellent slides on Graph Representation Learning by Jure Leskovec (Stanford Course — cs224w): http://web.stanford.edu/class/cs224w/slides/07-noderepr.pdf

[2] Video Graph Convolutional Networks (GCNs) made simple: https://www.youtube.com/watch?v=2KRAOZIULzw

[3] Semi-supervised Classification with Graph Convolutional Networks (2017): https://arxiv.org/pdf/1609.02907.pdf

[4] GCN source code: https://github.com/tkipf/gcn

[5] Demo with StellarGraph library: https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html

[6] A Comprehensive Survey on Graph Neural Networks (2019): https://arxiv.org/pdf/1901.00596.pdf

[7] DGL Library: https://docs.dgl.ai/guide/graph.html

### Simple and Deep Graph Convolutional Networks (SD-GCN) #### 背景介绍 图卷积网络(Graph Convolutional Networks, GCNs)作为一种强大的深度学习工具,在处理图结构数据方面表现优异。然而,传统的GCN模型由于过度平滑的问题,通常局限于较浅的网络架构[^5]。为了克服这一局限性并充分利用深度网络的强大表示能力,研究人员提出了多种改进方案。 其中一种重要的扩展是 **Simple and Deep Graph Convolutional Networks (SD-GCN)** 的设计思路,它引入了初始残差(Initial Residual)和恒等映射(Identity Mapping),从而有效解决了过平滑问题,并实现了更深层次的网络结构[^1]。 --- #### 工作原理 ##### 初始残差(Initial Residual) 初始残差的核心思想是在每层的输入中加入原始节点特征 \( X \) 。这种机制可以看作是对传统残差连接的一种增强形式: \[ H^{(l)} = f(H^{(l-1)}, A) + \alpha_l X, \] 其中: - \( H^{(l)} \) 表示第 \( l \)-th 层的隐藏状态; - \( f(\cdot,\cdot) \) 是基于邻接矩阵 \( A \) 的传播函数; - \( \alpha_l \) 是可学习参数,控制原始特征的影响程度。 通过这种方式,即使在网络层数增加时,原始特征仍然能够被保留下来,从而减轻了因多层传播而导致的信息损失或同质化现象。 ##### 恒等映射(Identity Mapping) 除了利用初始残差外,SD-GCN 还采用了跳跃连接的形式——即直接将前几层的结果传递给后续层作为附加输入之一。具体而言, \[ H^{(l)} = f(H^{(l-1)}, A) + \beta_l H^{(k)}, \] 这里 \( k<l \),\( \beta_l \) 同样是一个待优化系数向量。这种方法不仅有助于梯度流动更加顺畅,而且进一步增强了整个体系捕捉复杂模式的能力。 上述两项关键技术共同作用下,使得所构建出来的深层GCN具备更强表达力的同时也保持良好泛化性能。 --- #### 实现细节 以下是基于PyTorch框架的一个简化版 SD-GCN 示例代码片段: ```python import torch import torch.nn as nn import torch.nn.functional as F class GCNIILayer(nn.Module): def __init__(self, input_dim, output_dim, alpha=0.1, beta=0.1): super(GCNIILayer, self).__init__() self.linear = nn.Linear(input_dim, output_dim) self.alpha = alpha self.beta = beta def forward(self, x, adj_matrix, initial_x=None, prev_h=None): h = F.relu(torch.spmm(adj_matrix, self.linear(x))) if initial_x is not None: h += self.alpha * initial_x if prev_h is not None: h += self.beta * prev_h return h class SDCNNModel(nn.Module): def __init__(self, num_layers, input_dim, hidden_dim, output_dim, dropout_rate=0.5): super(SDCNNModel, self).__init__() layers = [] for i in range(num_layers): dim_in = input_dim if i == 0 else hidden_dim dim_out = output_dim if i == num_layers - 1 else hidden_dim layer = GCNIILayer(dim_in, dim_out) layers.append(layer) self.layers = nn.Sequential(*layers) self.dropout = nn.Dropout(dropout_rate) def forward(self, features, adjacency_matrix): init_features = features.clone() out = features residuals = [] for idx, gcn_layer in enumerate(self.layers): res_input = residuals[idx//2] if idx >=2 else None out = gcn_layer(out, adjacency_matrix, init_features, res_input) out = self.dropout(out) residuals.append(out) return out ``` 此段程序定义了一个基本版本的SDCNN模块,包含了多个`GCNIILayer`实例以及必要的dropout操作以提高鲁棒性和防止过拟合等问题发生。 --- #### 应用场景与优势对比 相比其他类型的GNN模型如u-net风格架构【^2】或者单纯依赖于线性变换的传统方法【^3】,采用SD-GCN策略具有如下几个显著优点: 1. 更加灵活可控:允许用户自定义不同层次间相互关系权重比例。 2. 显著提升效果:实验证明其能够在多项分类预测任务上取得超越现有SOTA水平的表现【^4】。 3. 广泛适用范围:无论是半监督还是完全标注的数据集均能展现出色适应能力和稳定性【^5】。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值