第四周.直播.03.论文带读+GAT

oldmao_2000

已于 2022-01-28 11:43:11 修改

阅读量1.5k

点赞数 1

分类专栏： # 小班课笔记文章标签：深度学习机器学习计算机视觉

于 2021-07-04 15:38:52 首次发布

本文链接：https://blog.csdn.net/oldmao_2001/article/details/118459230

版权

小班课笔记专栏收录该内容

27 篇文章 41 订阅

订阅专栏

文章目录

论文1泛读
论文2泛读
GAT by DGL

本文内容整理自深度之眼《GNN核心能力培养计划》
公式输入请参考：在线Latex公式

论文1泛读

Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning
主要从该文章知道，为什么GCN通常不会很深，以及GCN的缺点。

摘要套路分析

Many interesting problems in machine learning are being revisited with new deep learning tools. For graph-based semisupervised learning, a recent important development is graph convolutional networks (GCNs), which nicely integrate local vertex features and graph topology in the convolutional layers.
从ML过度到GCN，并带上GCN的特点：图卷积可以整合结点embedding和图的拓扑信息。

Although the GCN model compares favorably with other state-of-the-art methods, its mechanisms are not clear and it still requires considerable amount of labeled data for validation and model selection.
转折，提出当前GCN的问题所在：可解释性的机制木有，模型需要监督学习。

In this paper, we develop deeper insights into the GCN model and address its fundamental limits.
过渡句，然后分两点来介绍本文工作。

First, we show that the graph convolution of the GCN model is actually a special form of Laplacian smoothing, which is the key reason why GCNs work, but it also brings potential concerns of oversmoothing with many convolutional layers.
第一点：展示图卷积的本质（拉普拉斯平滑的一种形式），由于这种方式，使得GCN会产生oversmoothing问题。（也就是分析出产生问题的原因）

Second, to overcome the limits of the GCN model with shallow architectures, we propose both co-training and self-training approaches to train GCNs.
第二点：提出解决oversmoothing的方法的两个方法。

Our approaches significantly improve GCNs in learning with very few labels, and exempt them from requiring additional labels for validation. Extensive experiments on benchmarks have verified our theory and proposals.
本文方法的优点或者创新。

Introduction

前面引入和背景，然后是无监督、半监督的前人工作，然后是GCN的相关工作。接下来是本文的思路：
In this paper, we demystify the GCN model for semisupervised learning.
总：我们干了啥

In particular, we show that the graph convolution of the GCN model is simply a special form of Laplacian smoothing, which mixes the features of a vertex and its nearby neighbors.
分：具体是什么，we show…

The smoothing operation makes the features of vertices in the same cluster similar, thus greatly easing the classification task, which is the key reason why GCNs work so well.
针对上面得到结果进一步分析得到why GCNs work so well. 原因是平滑操作（图卷积）会使得相同的类别的节点会有类似的embedding。

However, it also brings potential concerns of over-smoothing.
提出问题。

If a GCN is deep with many convolutional layers, the output features may be oversmoothed and vertices from different clusters may become indistinguishable.
具体描述问题，过多的卷积层会使得特征过度平滑，使得不同类别的节点embedding区分度下降。

Also, adding more layers to a GCN will make it much more difficult to train.
最后补刀，过多的卷积层，使得模型难以训练。

However, a shallow GCN model such as the two-layer GCN used in (Kipf and Welling 2017) has its own limits.
再转折，浅层GCN也有不足。

Besides that it requires many additional labels for validation, it also suffers from the localized nature of the convolutional filter. When only few labels are given, a shallow GCN cannot effectively propagate the labels to the entire data graph.
一是需要大量的标签数据进行验证，少量标签数据很难使得标签信息传递到整个图中。

As illustrated in Fig. 1, the performance of GCNs drops quickly as the training size shrinks, even for the one with 500 additional labels for validation.
当训练数据较大时，性能还好，但是一旦训练数据减少，没有验证集的GCN准确率掉得很厉害（下面的绿线）。
在这里插入图片描述
后面再补充具体的做法。

Preliminaries and RelatedWorks

下面的第二节：Preliminaries and RelatedWorks中有讲Graph-Based半监督学习，然后重点讲GCN
给出具体的公式，从公式4
$H^{(l+1)}=\sigma(\tilde D^{-\frac{1}{2}}\tilde A\tilde D^{-\frac{1}{2}}H^{(l)}\Theta^{(l)})\tag4$
中可以看到，其实 $\tilde D^{-\frac{1}{2}}\tilde A\tilde D^{-\frac{1}{2}}$ 就是拉普拉斯平滑项。
然后讲半监督的GCN分类，具体看原文公式6：
$\textit{L}:=-\sum_{i\in V_l}\sum_{f=1}^FY_{if}\ln Z_{if}\tag6$

$i\in V_l$ 表示只针对有label的结点，也就是半监督，后面部分是交叉熵， $Y_{if}$ 是真实值， $Z_{if}$ 是预测值。F是输出维度，与实际分类数相等。

Analysis

这里有点意思，作者为了证明拉普拉斯平滑项的作用，将上面公式4的平滑项拿走（消融实验的思想），变成FCN：

$H^{(l+1)}=\sigma(H^{(l)}\Theta^{(l)})\tag7$
相当于直接丢完全图进GCN，不考虑邻居，而是人人都是邻居。结果垮掉：
在这里插入图片描述
说明拉普拉斯平滑项很重要。
然后经过原始拉普拉斯的公式（原文公式9）和GCN的公式的比较，得到一个结论：
The Laplacian smoothing computes the new features of a vertex as the weighted average of itself and its neighbors’. Since vertices in the same cluster tend to be densely connected, the smoothing makes their features similar, which makes the subsequent classification task much easier. As we can see from Table 1, applying the smoothing only once has already led to a huge performance gain.

然后讲为什么两层GCN比一层好（两层比一层更平滑）：
Multi-layer Structure.We can also see from Table 1 that while the 2-layer FCN only slightly improves over the 1-layer FCN, the 2-layer GCN significantly improves over the 1-layer GCN by a large margin. This is because applying smoothing again on the activations of the first layer makes the output features of vertices in the same cluster more similar and further eases the classification task.

然后讨论是不是层数越多越好？
A natural question is how many convolutional layers should be included in a GCN?
Certainly not the more the better. On the one hand, a GCN with many layers is difficult to train. On the other hand, repeatedly applying Laplacian smoothing may mix the features of vertices from different clusters and make them indistinguishable.
这里作者做了一个实验：
在这里插入图片描述
目测两层分类效果最好。
实验弄完，作者理论上又证明了一发：
In the following, we will prove that by repeatedly applying Laplacian smoothing many times, the features of vertices within each connected component of the graph will converge to the same values.
红线部分相当于拉普拉斯平滑项，m代表重复m次，w是参数，右边可以看到结果就是所有节点都一样。
在这里插入图片描述
除了证明，作者还啰嗦了这么一段：
Since label propagation only uses the graph information while GCNs utilize both structural and vertex features, it reflects the inability of the GCN model in exploring the global graph structure.
由于label propagation（图一中的蓝色线）在传播过程中只用了图信息，因此其受拉普拉斯平滑影响较小。GCN使用了结点embedding和图结构信息，因此其获取全局信息能力不强，主要是获取local信息。

Solutions

优点和缺点：
The advantages are: 1) the graph convolution – Laplacian smoothing helps making the classification problem much easier; 2) the multi-layer neural network is a powerful feature extractor.
The disadvantages are: 1) the graph convolution is a localized filter, which performs unsatisfactorily with few labeled data; 2) the neural network needs considerable amount of labeled data for validation and model selection.（标签数据比较少的时候，会使得某些局部点无法获取邻居信息从而不会有loss产生）

论文2泛读

DeepGCNs:Can GCNs Go as Deep as CNNs?
这篇论文借鉴CV中的resNet思想，将残差引入GCN，使得GCN 可以更加DEEP

Abstract

先讲CNN，并提出CNN可以go deep。
Convolutional Neural Networks (CNNs) achieve impressive performance in a wide variety of fields. Their success benefited from a massive boost when very deep CNN models were able to be reliably trained.

CNN在非欧式距离上效果不行，因此引入GCN。并提到可以把CNN的trick用到GCN上。
Despite their merits, CNNs fail to properly address problems with non-Euclidean data. To overcome this challenge, Graph Convolutional Networks (GCNs) build graphs to represent non-Euclidean data, borrow concepts from CNNs, and apply them in training.

转折，提出GCN存在的问题

原文的图一左边也显示了，在没有残差网络的情况下，GCN层数越多，效果越差（梯度消失）
右边则显示了，加残差后效果就起飞了。
在这里插入图片描述
GCNs show promising results, but they are usually limited to very shallow models due to the vanishing gradient problem (see Figure 1).

因此，在没有残差网络的艰苦条件下，SOTA的GCN通常是3-4层。

为了解决这个问题，作者引入了CNN中的residual/dense connections and dilated convolutions，并应用到GCN上，解决了GCN不能deep的问题
In this work, we present new ways to successfully train very deep GCNs. We do this by borrowing concepts from CNNs, specifically residual/dense connections and dilated convolutions, and adapting them to GCN architectures.

最后要吹爆自己的idea
Extensive experiments show the positive effect of these deep GCN frameworks. Finally, we use these new concepts to build a very deep 56-layer GCN, and show how it significantly boosts performance (+3.7% mIoU over state-of-the-art) in the task of point cloud semantic segmentation. We believe that the community can greatly benefit from this work, as it opens up many opportunities for advancing GCN-based research.

3.2. Residual Learning for GCNs

关于ResNet和DenseNet的介绍可以看这里：zhuanlan.zhihu.com/p/37189203，那么把残差引入到GCN是什么形式？看原文公式3：
$\begin{aligned}\mathcal{G}_{l+1}&=\mathcal{H}(\mathcal{G}_{l},\mathcal{W}_{l})\\ &=\mathcal{F}(\mathcal{G}_{l},\mathcal{W}_{l})+\mathcal{G}_{l}=\mathcal{G}_{l+1}^{res}+\mathcal{G}_{l}\end{aligned}\tag3$

3.3. Dense Connections in GCNs

如果按DenseNet的方式来做：
$\begin{aligned}\mathcal{G}_{l+1}&=\mathcal{H}(\mathcal{G}_{l},\mathcal{W}_{l})\\ &=\mathcal{T}(\mathcal{F}(\mathcal{G}_{l},\mathcal{W}_{l}),\mathcal{G}_{l})\\ &=\mathcal{T}(\mathcal{F}(\mathcal{G}_{l},\mathcal{W}_{l}),\cdots,\mathcal{F}(\mathcal{G}_{0},\mathcal{W}_{0}),\mathcal{G}_{0}) \end{aligned}\tag4$
从最后的表示来可以看到 $l + 1$ 层和之前的 $l$ 到0层都分别发生了残差操作。

GAT by DGL

官网：https://docs.dgl.ai/tutorials/models/1_gnn/9_gat.html
$\begin{aligned} z_i^{(l)}&=W^{(l)}h_i^{(l)},&(1) \\ e_{ij}^{(l)}&=\text{LeakyReLU}(\vec a^{(l)^T}(z_i^{(l)}||z_j^{(l)})),&(2) \\ \alpha_{ij}^{(l)}&=\frac{\exp(e_{ij}^{(l)})}{\sum_{k\in \mathcal{N}(i)}^{}\exp(e_{ik}^{(l)})},&(3) \\ h_i^{(l+1)}&=\sigma\left(\sum_{j\in \mathcal{N}(i)} {\alpha^{(l)}_{ij} z^{(l)}_j }\right),&(4) \end{aligned}$
公式1-3对应的图片表示如下：
在这里插入图片描述

导入

from dgl.nn.pytorch import GATConv

import torch
import torch.nn as nn
import torch.nn.functional as F

定义GATLayer

套路，具体看：


class GATLayer(nn.Module):
	# 初始化
    def __init__(self, g, in_dim, out_dim):
        super(GATLayer, self).__init__()
        self.g = g
        # equation (1)
        self.fc = nn.Linear(in_dim, out_dim, bias=False)
        # equation (2)由于做了拼接，所以这里维度是*2的
        self.attn_fc = nn.Linear(2 * out_dim, 1, bias=False)
        self.reset_parameters()

    def reset_parameters(self):
        """Reinitialize learnable parameters."""
        gain = nn.init.calculate_gain('relu')
        nn.init.xavier_normal_(self.fc.weight, gain=gain)
        nn.init.xavier_normal_(self.attn_fc.weight, gain=gain)

    def edge_attention(self, edges):
        # edge UDF for equation (2)
        # 在公式2中结点之间的相互attention可以看做是边的权重，因此这里要自定义边信息
        z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)
        a = self.attn_fc(z2)
        return {'e': F.leaky_relu(a)}

    def message_func(self, edges):
        # message UDF for equation (3) & (4)
        return {'z': edges.src['z'], 'e': edges.data['e']}

    def reduce_func(self, nodes):
        # reduce UDF for equation (3) & (4)
        # equation (3)
        alpha = F.softmax(nodes.mailbox['e'], dim=1)
        # equation (4)
        h = torch.sum(alpha * nodes.mailbox['z'], dim=1)
        return {'h': h}

    def forward(self, h):
        # equation (1)
        z = self.fc(h)
        self.g.ndata['z'] = z
        # equation (2)
        self.g.apply_edges(self.edge_attention)
        # equation (3) & (4)
        self.g.update_all(self.message_func, self.reduce_func)
        return self.g.ndata.pop('h')

定义多头注意力

多头注意力机制得到的结果可以进行拼接，也可以做平均。

class MultiHeadGATLayer(nn.Module):
    def __init__(self, g, in_dim, out_dim, num_heads, merge='cat'):
        super(MultiHeadGATLayer, self).__init__()
        self.heads = nn.ModuleList()
        for i in range(num_heads):#根据多头数量并行计算
            self.heads.append(GATLayer(g, in_dim, out_dim))
        self.merge = merge#merge的方式有两种

    def forward(self, h):
        head_outs = [attn_head(h) for attn_head in self.heads]
        if self.merge == 'cat':#一种是拼接
            # concat on the output feature dimension (dim=1)
            return torch.cat(head_outs, dim=1)
        else:
            # merge using average一种是平均
            return torch.mean(torch.stack(head_outs))

定义模型

用两层多头注意力层组成GAT

class GAT(nn.Module):
    def __init__(self, g, in_dim, hidden_dim, out_dim, num_heads):
        super(GAT, self).__init__()
        self.layer1 = MultiHeadGATLayer(g, in_dim, hidden_dim, num_heads)
        # Be aware that the input dimension is hidden_dim*num_heads since
        # multiple head outputs are concatenated together. Also, only
        # one attention head in the output layer.
        self.layer2 = MultiHeadGATLayer(g, hidden_dim * num_heads, out_dim, 1)

    def forward(self, h):
        h = self.layer1(h)
        #相比relu，elu存在负值，可以将激活单元的输出均值往0推近
        h = F.elu(h)
        h = self.layer2(h)
        return h

加载数据集

用的cora

from dgl import DGLGraph
from dgl.data import citation_graph as citegrh
import networkx as nx

def load_cora_data():
    data = citegrh.load_cora()
    features = torch.FloatTensor(data.features)#节点特征
    labels = torch.LongTensor(data.labels)#标签
    mask = torch.BoolTensor(data.train_mask)#分割训练测试集
    g = DGLGraph(data.graph)
    return g, features, labels, mask

训练

import time
import numpy as np

g, features, labels, mask = load_cora_data()

# create the model, 2 heads, each head has hidden size 8
net = GAT(g,
          in_dim=features.size()[1],
          hidden_dim=8,
          out_dim=7,#七分类，输出维度是7
          num_heads=2)

# create optimizer
optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)

# main loop
dur = []
for epoch in range(30):
    if epoch >= 3:
        t0 = time.time()

	#交叉熵
    logits = net(features)
    logp = F.log_softmax(logits, 1)
    loss = F.nll_loss(logp[mask], labels[mask])

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch >= 3:
        dur.append(time.time() - t0)

    print("Epoch {:05d} | Loss {:.4f} | Time(s) {:.4f}".format(
        epoch, loss.item(), np.mean(dur)))

分析

最后，根据计算对节点间注意力可视化，以及测试集的测试结果来看，GAT和GCN的结果相差不大，因为cora数据集中节点对于邻居节点的关系就没有特别的注意力关系，因此GAT的结果是和GCN差不多的，如果换成PPI数据集，里面的蛋白质分子明显是有attention关系的，GAT的效果要比GCN好太多，具体可以看官网链接。