翻译--SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

最新推荐文章于 2023-10-11 15:23:28 发布

wanygen

最新推荐文章于 2023-10-11 15:23:28 发布

阅读量1.3w

点赞数 13

分类专栏：复杂网络 GCN 文章标签： GCN

复杂网络同时被 2 个专栏收录

3 篇文章 1 订阅

订阅专栏

GCN

1 篇文章 0 订阅

订阅专栏

SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

图卷积神经网络的半监督学习

原文连接: Semi-Supervised Classification with Graph Convolutional Networks

0. 摘要

We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.

我们提出了一种可扩展的基于图数据结构的半监督学习方法，该方法基于一个有效的卷积神经网络变种，这种变种能够直接对图进行操作。我们通过光谱图卷积的局部一阶近似来确定卷积网络结构的选择。我们的模型在图的边的数量上线性缩放，并可以学习隐藏层表示，这些表示既编码局部图结构，也能够编码节点的特征。在大量关于引用网络和知识图谱网络数据集的实验后，我们认为我们的方法显著优于相关方法。

1.介绍

We consider the problem of classifying nodes (such as documents) in a graph (such as a citation network), where labels are only available for a small subset of nodes. This problem can be framed as graph-based semi-supervised learning, where label information is smoothed over the graph via some form of explicit graph-based regularization (Zhu et al., 2003; Zhou et al., 2004; Belkin et al., 2006; Weston et al., 2012), e.g. by using a graph Laplacian regularization term in the loss function:

我们考虑对图（例如引文网络）中的节点（如引文网络中的文章）进行分类的问题，其中仅有一小部分节点有标签（即明确知道该节点属于哪一类）。这个问题可以被定义为基于图的半监督学习，其中标签信息通过基于显示图的正则化而被平滑化，例如，通过在损失函数中使用图的拉普拉斯正则项：

L 0 = L 0 + λ L r e g L r e g = \sum i, j A i, j | | f (X i) - f (X j) | | 2 = f (X) T Δ f (X) (1)

$\mathcal{L_0 = L_0 + \lambda L_{reg}} \quad L_{reg} = \sum_{i, j}{A_{i,j}||f(X_i) - f(X_j)||^2 = f(X)^T \Delta f(X)} \qquad (1)$

Here, $\mathcal{L_0}$ denotes the supervised loss w.r.t. the labeled part of the graph, $f(·)$ can be a neural network-like differentiable function, $λ$ is a weighing factor and $X$ is a matrix of node feature vectors $X_i$ . $\Delta = D − A$ denotes the unnormalized graph Laplacian of an undirected graph $\mathcal{G = (V, E)}$ with $N$ nodes $v_i \in V$ , edges $(v_i, v_j) \in E$ , an adjacency matrix $A \in R^{N×N}$ (binary or weighted) and a degree matrix $D_{ii} = \sum_j A_{ij}$ . The formulation of Eq. 1 relies on the assumption that connected nodes in the graph are likely to share the same label. This assumption, however, might restrict modeling capacity, as graph edges need not necessarily encode node similarity, but could contain additional information.

这里， $\mathcal {L_0}$ 表示图中有label部分的监督损失， $f(·)$ 可以是一个神经网络似微分函数， $λ$ 是一个加权因子， $X$ 是节点特征向量 $X_i$ 的矩阵， $\Delta = D − A$ 表示无向图 $\mathcal{G = (V, E)}$ 的非标准图拉普拉斯算子， $A \in R^{N×N}$ （权重为0，1或者加权）表示邻接矩阵， $D_{ii} = \sum_j A_{ij}$ 表示度矩阵。式 (1)依赖于图中连接节点可能共享相同标签的假设。然而，这种假设可能会限制建模能力，因为图的边不一定需要编码节点相似性，而可能包含其他信息。

In this work, we encode the graph structure directly using a neural network model $f(X, A)$ and train on a supervised target $\mathcal {L_0}$ for all nodes with labels, thereby avoiding explicit graph-based regularization in the loss function. Conditioning $f(·)$ on the adjacency matrix of the graph will allow the model to distribute gradient information from the supervised loss $\mathcal {L_0}$ and will enable it to learn representations of nodes both with and without labels.

在本文中，我们直接使用神经网络模型 $f(X, A)$ 对图结构进行编码，并对所有带标签的节点进行有监督 loss $\mathcal {L_0}$ 训练，从而避免在损失函数中进行基于显示的图的正则化。在图的邻接矩阵上调节 $f(·)$ 将允许模型从监督损失 $\mathcal {L_0}$ 中分配梯度信息，并使其能够学习带标签用和不带标签的节点的表示。

Our contributions are two-fold. Firstly, we introduce a simple and well-behaved layer-wise propagation rule for neural network models which operate directly on graphs and show how it can be motivated from a first-order approximation of spectral graph convolutions (Hammond et al., 2011). Secondly, we demonstrate how this form of a graph-based neural network model can be used for fast and scalable semi-supervised classification of nodes in a graph. Experiments on a number of datasets demonstrate that our model compares favorably both in classification accuracy and efficiency (measured in wall-clock time) against state-of-the-art methods for semi-supervised learning.

我们做了两方面的工作。首先，我们为神经网络模型引入一个简单而且表现良好的分层传播规则，该规则直接在图上运行，并显示它如何从光谱图卷积的一阶近似中得到激励。其次，我们论证了为什么可以将这种基于图形的神经网络模型用于图中节点的快速可扩展半监督分类。对许多数据集进行的研究表明，我们的模型在分类准确性和效率（在挂钟时间中测量）上与半监督学习的最新方法相比有优势。

2. 图的快速近似卷积

In this section, we provide theoretical motivation for a specific graph-based neural network model $f(X, A)$ that we will use in the rest of this paper. We consider a multi-layer Graph Convolutional Network (GCN) with the following layer-wise propagation rule:

在本节中，我们为特定的基于图的神经网络模型 $f(X, A)$ 提供本文后面用到的理论机制，我们考虑具有以下分层传播规则的多层图形卷积网络（GCN）：

H (l + 1) = σ (D ~ - 1 / 2 A ~ D ~ - 1 / 2 H (l) W (l)) (2)

$H ^{(l+1)} = σ(\tilde D^{-1/2} \tilde A \tilde D^{ − 1/2} H^{(l)}W^{(l)} ) \qquad (2)$

Here, $\tilde A = A + I_N$ is the adjacency matrix of the undirected graph $\mathcal G$ with added self-connections. $I_N$ is the identity matrix, $\tilde D_{ii} = \sum_j \tilde A_{ij}$ and $W^{(l)}$ is a layer-specific trainable weight matrix. $σ(·)$ denotes an activation function, such as the $ReLU(·) = max(0, ·)$ . $H^{(l)} \in R^{N×D}$ is the matrix of activations in the $l^{th}$ layer; $H^{(0)} = X$ . In the following, we show that the form of this propagation rule can be motivated via a first-order approximation of localized spectral filters on graphs (Hammond et al., 2011; Defferrard et al., 2016).

这里， $\tilde A = A + I_N$ 是带有自环的无向图的邻接矩阵。 $I_N$ 是单位矩阵。 $\tilde D_{ii} = \sum_j \tilde A_{ij}$ 和 $W^{(l)}$ 是一个layer-specific的可训练权重矩阵。 $σ(·)$ 激活函数，例如Relu。 $H^{(l)} \in R^{N×D}$ 是 $l^{th}$ 层的激活矩阵； $H^{(0)} = X$ 。接下来，我们表明这种传播规则的形式可以通过图上的局部光谱过滤器的一阶近似来激发。

2.1 光谱图卷积(SPECTRAL GRAPH CONVOLUTIONS)

We consider spectral convolutions on graphs defined as the multiplication of a signal $x \in R_N$ (a scalar for every node) with a filter $g_θ = diag(θ)$ parameterized by $θ \in R_N$ in the Fourier domain, i.e.:
我们考虑信号 $x \in R_N$ 与参数为 $θ \in R_N$ 的滤波器 $g_θ = diag(θ)$ 在傅里叶域的谱卷积。

g θ * x = U g θ U T x (3)

$g_\theta * x = Ug_\theta U^Tx \qquad (3)$

where $U$ is the matrix of eigenvectors of the normalized graph Laplacian $L = I_N − D^{−1/2} AD^{−1/2} = UΛU^T$ , with a diagonal matrix of its eigenvalues $Λ$ and $U^Tx$ being the graph Fourier transform of $x$ . We can understand $g_θ$ as a function of the eigenvalues of $L$ , i.e. $g_θ(Λ)$ . Evaluating Eq. 3 is computationally expensive, as multiplication with the eigenvector matrix $U$ is $O(N^2)$ . Furthermore, computing the eigendecomposition of $L$ in the first place might be prohibitively expensive for large graphs. To circumvent this problem, it was suggested in Hammond et al. (2011) that $g_θ(Λ)$ can be well-approximated by a truncated expansion in terms of Chebyshev polynomials $T_k(x)$ up to $K^{th}$ order:

$U$ 是归一化图 $L = I_N − D^{−1/2} AD^{−1/2} = UΛU^T$ 拉普拉斯算子的特征向量矩阵，它的特征值的对角矩阵是 $Λ$ ， $U^Tx$ 是 $x$ 的傅里叶变换。我们可以认为 $g_θ$ 是特征值 $L$ 的一个函数，例如 $g_θ(Λ)$ 。式3的计算量很大，因为特征向量矩阵 $U$ 的复杂度是 $O(N^2)$ 。此外，对于大型图来说， $L$ 特征值分解的计算量也很大。为了解决这个问题，Hammond et al.(2011) 指出 $g_θ(Λ)$ 可以很好的通过Chebyshev多项式 $T_k(x)$ 到 $K^{th}$ 阶的截断展开来近似。

g θ' (Λ) \approx \sum k = 0 K θ' k T K (Λ ~) (4)

$g_{\theta ^{'}}(Λ) \approx \sum^{K}_{k=0} \theta{'}_kT_K(\tilde Λ) \qquad (4)$

with a rescaled $\tilde Λ = 2Λ / λ_{max}− I_N$ . $λ_{max}$ denotes the largest eigenvalue of $L$ . $θ^{'} \in R_K$ is now a vector of Chebyshev coefficients. The Chebyshev polynomials are recursively defined as $T_k(x) = 2xT_{k−1}(x) − T_{k−2}(x)$ , with $T_0(x) = 1$ and $T_1(x) = x$ . The reader is referred to Hammond et al. (2011) for an in-depth discussion of this approximation.

重新调整 $\tilde Λ = 2Λ / λ_{max}− I_N$ 。 $λ_{max}$ 是 $L$ 的最大特征值。 $θ^{'} \in R_K$ 是切比雪夫系数的矢量。Chebyshev多项式递归定义为 $T_k(x) = 2xT_{k−1}(x) − T_{k−2}(x)$ ，其中 $T_0(x) = 1$ ， $T_1(x) = x$ 。读者可以参考Hammond et al. (2011) 深入讨论过的这种近似方法。

Going back to our definition of a convolution of a signal $x$ with a filter $g_{θ^{'}}$ , we now have:

回到我们对信号 $x$ 与滤波器 $g_{θ^{'}}$ 的卷积的定义，我们现在有：

g θ' * x = \sum k = 0 K θ' k T K (L ~) x (5)

$g_{\theta^{'}} * x = \sum^{K}_{k=0} \theta{'}_kT_K(\tilde L)x \qquad (5)$

with $\tilde L = 2 L /λ_{max} − I_N$ ; as can easily be verified by noticing that $(UΛU^T)^k = UΛ^kU^T$ . Note that this expression is now $K$ -localized since it is a $K^{t h}$ -order polynomial in the Laplacian, i.e. it depends only on nodes that are at maximum $K$ steps away from the central node ( $K^{t h}$ -order neighborhood). The complexity of evaluating Eq. 5 is $O(|\mathcal E|)$ , i.e. linear in the number of edges. Defferrard et al. (2016) use this $K$ -localized convolution to define a convolutional neural network on graphs.

其中 $\tilde L = 2 L /λ_{max} − I_N$ 。易证 $(UΛU^T)^k = UΛ^kU^T$ ，请注意，此表达式现在是 $K$ -localized，因为它是拉普拉斯算子中的 $K ^ {t h}$ - 阶多项式，即它仅取决于离中央节点( $K ^ {t h}$ 阶邻域)最大 $K$ 步的节点。式5的复杂度是 $O(|\mathcal E|)$ ，即与边数线性关系。Defferrard等人（2016）使用这个 $K$ - 局部卷积来定义图上的卷积神经网络。

2.2 线性模型（LAYER-WISE LINEAR MODEL）

A neural network model based on graph convolutions can therefore be built by stacking multiple
convolutional layers of the form of Eq. 5, each layer followed by a point-wise non-linearity. Now,
imagine we limited the layer-wise convolution operation to $K = 1$ (see Eq. 5), i.e. a function that is linear w.r.t. $L$ and therefore a linear function on the graph Laplacian spectrum.

因此可以通过堆叠多个形式为式5的卷积层来建立基于图卷积的神经网络模型。现在，我们将分层卷积操作限制为 $K = 1$ （式5），例如关于 $L$ 是线性的，因此在图拉普拉斯谱上具有线性函数。

In this way, we can still recover a rich class of convolutional filter functions by stacking multiple such layers, but we are not limited to the explicit parameterization given by, e.g., the Chebyshev polynomials. We intuitively expect that such a model can alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions, such as social networks, citation networks, knowledge graphs and many other real-world graph datasets. Additionally, for a fixed computational budget, this layer-wise linear formulation allows us to build deeper models, a practice that is known to improve modeling capacity on a number of domains (He et al.,2016).

通过这种方式，我们仍然可以通过堆叠多个这样的网络层来恢复丰富的卷积滤波函数，但我们不限于由例如Chebyshev多项式给出的显式参数化。我们直观地预期，这样的模型可以缓解节点度分布非常广的图的局部邻域结构过拟合问题，如社交网络，引文网络，知识图谱和许多其他真实世界的图形数据集。此外，对于有限的计算资源，这种分层线性公式允许我们建立更深的模型，这已经被证明可以在很多领域改进模型性能（He et al。，2016）。

In this linear formulation of a GCN we further approximate $λ_{max} ≈ 2$ , as we can expect that neural network parameters will adapt to this change in scale during training. Under these approximations Eq. 5 simplifies to:

在GCN的这个线性公式中，我们进一步近似 $λ_{max}≈2$ , 因为我们可以预计神经网络参数将在训练过程中适应这种规模变化。根据这些近似，式5简化为：

g θ' * x \approx θ' 0 x + θ' 1 (L - I N) x = θ' 0 x - θ' 1 D - 1 / 2 A D - 1 / 2 x (6)

$g_{θ^{'}} * x ≈ θ_0^{'} x + θ_1^{'} (L − I_N ) x = θ_0^{'} x − θ_1^{'} D^{− 1/2} AD^{− 1 /2} x \qquad (6)$

with two free parameters $θ_0^{'}$ and $θ_1^{'}$ . The filter parameters can be shared over the whole graph. Successive application of filters of this form then effectively convolve the $k^{th}$ -order neighborhood of a node, where $k$ is the number of successive filtering operations or convolutional layers in the neural network model.

有两个自由参数 $θ_0^{'}$ 和 $θ_1^{'}$ 。滤波器参数可以被整个图上共享。连续应用这种形式的滤波器，然后有效地卷积节点的 $k ^ {th}$ - 阶邻域，其中 $k$ 是神经网络模型中连续滤波操作或卷积层的数目。

In practice, it can be beneficial to constrain the number of parameters further to address overfitting and to minimize the number of operations (such as matrix multiplications) per layer. This leaves us
with the following expression:

实际上，进一步限制参数的数量以解决过拟合并最小化每层的操作数量（例如矩阵乘法）会是有益的。所以我们有以下表达式：

g θ * x \approx θ (I N + D - 1 / 2 A D - 1 / 2) x (7)

$g_θ * x ≈ θ (I_N + D^{− 1 /2} AD^{− 1 /2} ) x \qquad (7)$

with a single parameter $θ = θ_0^{'} = −θ_1^{'}$ . Note that ${I_N + D^{− 1 /2} AD^{− 1/ 2}}$ now has eigenvalues in the range $[0, 2]$ . Repeated application of this operator can therefore lead to numerical instabilities and exploding/vanishing gradients when used in a deep neural network model. To alleviate this problem, we introduce the following renormalization trick: ${I_N + D^{− 1 /2} AD^{− 1/ 2}} \to \tilde D^{− 1 /2} \tilde A \tilde D^{− 1/ 2}$ , with $\tilde A = A + I_N$ and $\tilde D_{ii} = \sum_j \tilde A_{ij}$ .

只有一个参数 $θ = θ_0^{'} = −θ_1^{'}$ 。注意到 ${I_N + D^{− 1 /2} AD^{− 1/ 2}}$ 是有范围 $[0, 2]$ 的特征值。因此，如果在深度神经网络模型中使用该算子，则反复应用该算子会导致数值不稳定性和梯度爆炸/消失。我们介绍下面的归一化技巧： ${I_N + D^{− 1 /2} AD^{− 1/ 2}} \to \tilde D^{− 1 /2} \tilde A \tilde D^{− 1/ 2}$ , 其中 $\tilde A = A + I_N$ ， $\tilde D_{ii} = \sum_j \tilde A_{ij}$ .

We can generalize this definition to a signal $X \in R^{N×C}$ with $C$ input channels (i.e. a $C$ -dimensional feature vector for every node) and $F$ filters or feature maps as follows:

我们可以将这个定义推广到具有 $C$ 个输入通道（即每个节点的 $C$ 维特征向量）的信号 $X \in R ^ {N×C}$ 和 $F$ 个滤波器或特征映射如下：

Z = D ~ - 1 / 2 A ~ D ~ - 1 / 2 X Θ (8)

$Z = \tilde D^{− 1 /2} \tilde A \tilde D^{− 1/ 2} XΘ \qquad (8)$

where $Θ \in R^{C×F}$ is now a matrix of filter parameters and $Z \in R^{N×F}$ is the convolved signal matrix. This filtering operation has complexity $O(|\mathcal E|FC)$ , as $\tilde AX$ can be efficiently implemented as a product of a sparse matrix with a dense matrix.

其中 $Θ \in R^{C×F}$ 是一个滤波器参数矩阵， $Z \in R^{N×F}$ 是卷积信号参数矩阵。这个滤波操作复杂度是 $O(|\mathcal E|FC)$ ，因为 $\tilde AX$ 可以有效地实现为密集矩阵和稀疏矩阵的乘积。

3. 半监督节点分类（SEMI-SUPERVISED NODE CLASSIFICATION）

Having introduced a simple, yet flexible model $f(X, A)$ for efficient information propagation on graphs, we can return to the problem of semi-supervised node classification. As outlined in the introduction, we can relax certain assumptions typically made in graph-based semi-supervised learning by conditioning our model $f(X, A)$ both on the data $X$ and on the adjacency matrix $A$ of the underlying graph structure. We expect this setting to be especially powerful in scenarios where the adjacency matrix contains information not present in the data $X$ , such as citation links between documents in a citation network or relations in a knowledge graph. The overall model, a multi-layer GCN for semi-supervised learning, is schematically depicted in Figure 1. 图1.

介绍了一个简单但灵活的可以在图上有效地传播信息模型 $f(X, A)$ ，我们可以回到半监督节点分类的问题上。如前言所述，我们可以通过调整我们的模型 $f(X, A)$ 来放松通常在基于图的半监督学习中所做的某些假设，我们希望这种设置可以在邻接矩阵种包含信息但数据 $X$ 没有表现出来的情况下更有用，例如引用网络中文档之间的引用链接或知识图谱中的关系。整体半监督学习的多层GCN模型，如图1所示。

3.1 例子（EXAMPLE）

In the following, we consider a two-layer GCN for semi-supervised node classification on a graph with a symmetric adjacency matrix $A$ (binary or weighted). We first calculate $\hat A = \tilde D^{− 1 /2} \tilde A \tilde D^{− 1/ 2}$ in a pre-processing step. Our forward model then takes the simple form:

接下来，我们考虑一个两层的半监督节点分类GCN模型，在对称邻接矩阵 $A$ (binary or weighted) 上操作。在预处理步骤中，我们首先计算 $\hat A = \tilde D^{− 1 /2} \tilde A \tilde D^{− 1/ 2}$ .我们的前向计算变成一个简单的形式：

Z = f (X, A) = s o f t m a x (A^R e L U (A^X W (0)) W (1)) (9)

$Z = f(X, A) = softmax(\hat A ReLU(\hat AXW^ {(0)}) W^{(1)}) \qquad (9)$

Here, $W^{(0)} \in R^{C×H}$ is an input-to-hidden weight matrix for a hidden layer with $H$ feature maps. $W^{(1)} \in R^{H×F}$ is a hidden-to-output weight matrix. The softmax activation function, defined as $softmax(x_i) = exp(x_i) / Z$ with $Z = \sum _i exp(x_i)$ , is applied row-wise. For semi-supervised multiclass classification, we then evaluate the cross-entropy error over all labeled examples:

这里， $W^{(0)} \in R^{C×H}$ 是输入层到隐藏层的权重矩阵，隐藏层有 $H$ 个特征。 $W^{(1)} \in R^{H×F}$ 是隐藏层到输出层的权重矩阵。softmax作用在每一行上，对于半监督多类别分类，我们评估所有标记标签的交叉熵误差：

L = - \sum l \in y L \sum f = 1 F Y l f l n Z l f (10)

$\mathcal L = -\sum_{l \in y_L}\sum^{F}_{f=1}{Y_{lf}lnZ_{lf}} \qquad (10)$

where $Y_L$ is the set of node indices that have labels.

The neural network weights $W^{(0)}$ and $W^{(1)}$ are trained using gradient descent. In this work, we perform batch gradient descent using the full dataset for every training iteration, which is a viable option as long as datasets fit in memory. Using a sparse representation for $A$ , memory requirement is $O(|E|)$ , i.e. linear in the number of edges. Stochasticity in the training process is introduced via dropout (Srivastava et al., 2014). We leave memory-efficient extensions with mini-batch stochastic gradient descent for future work.