Datawhale 图神经网络 Task03基于图神经网络的节点表征学习

学习课程:gitee_Datawhale_GNN
学习论坛:Datawhale CLUB
公众号:Datawhale

本节的图神经网络开始撸代码,逐渐进入实战,不那么抽象了。
如下是使用三种不同的网络模型学习torch_geometric自带的两个数据集的对比结果。
第一个模型MLP,可是说是比较传统的神经网络模型,线性变化、relu、dropout,高端的菜肴往往就是烹饪的朴实无华。
第二和第三个模型是GCN和GAT,与MLP的区别主要在于将边的信息加了进来。
然后,三个模型在代码上的改动非常的小,代码的样式很像,或者说是一模一样,这要感谢python以及pytorch,让代码如此的简洁。
那么先看图说话,看一下三种模型的区别。
“Cora"数据集(Cora数据集由机器学习论文组成,是近年来图深度学习很喜欢使用的数据集。在数据集中,论文分为七类)上图神经网络模型完胜MLP,简直就是为图神经网络量身订造的数据集。可以看到结果上,图神经网络把类别分的很开很直观,但是MLP仿佛没有搞懂到底要干什么就被要求强行交作业了(像极了临时抱佛脚打卡的我),MLP的最终得分数也很写实,59分。
"Cora"
接下来是”CiteSeer"数据集,与PubMed一样,并肩之间的Cora,他们都是引文数据集(卷啊卷,图卷积神经网络的知名测试数据是人工智能类的论文)。共同点是测试集和验证集都很小,都希望管中窥得全豹。不同点是测试数据的数据量越来越大。
torch_geometric.data.planetoid的三大引文数据集
数量量一大之后,MLP从59掉到了58,但是另外两个图神经网络掉分掉的更厉害了,不明觉厉。
"CiteSeer"
尊重他人造轮子的本意,能少写行代码就少写行代码。我把MLP,GCN和GAT三个网络以及Cora, CiteSeer和PubMed三份数据以for循环的形式,改编为如下的代码,直接赋值运行就阔以了。最终是3*3=9个测试结果。

import torch
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.nn import GATConv
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures
# 可视化节点表征分布的方法
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
def visualize(h, color,title=None):
    z = TSNE(n_components=2).fit_transform(out.detach().cpu().numpy())
    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])
    plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap="Set2")
    plt.title(title)
    plt.show()


'''迭代gnn_model和数据集,得到训练的结果'''
for Planetoid_dataset in ["Cora", "CiteSeer"]:
# for Planetoid_dataset in ["Cora", "CiteSeer", "PubMed"]:
    for gnn_model in ['MLP','GCN','GAT']:

        # 获取并分析数据集
        dataset = Planetoid(root='data/Planetoid', name=Planetoid_dataset, transform=NormalizeFeatures())
        data = dataset[0]  # Get the first graph object.

        # 定义网络的模型
        class GNN(torch.nn.Module):
            def __init__(self, hidden_channels):
                super(GNN, self).__init__()
                torch.manual_seed(12345)
                if gnn_model=='MLP':
                    self.lin1 = Linear(dataset.num_features, hidden_channels)
                    self.lin2 = Linear(hidden_channels, dataset.num_classes)
                elif gnn_model=='GCN':
                    self.conv1 = GCNConv(dataset.num_features, hidden_channels)
                    self.conv2 = GCNConv(hidden_channels, dataset.num_classes)
                elif gnn_model=='GAT':
                    self.conv1 = GATConv(dataset.num_features, hidden_channels)
                    self.conv2 = GATConv(hidden_channels, dataset.num_classes) 
            if gnn_model=='MLP':
                def forward(self, x):
                    x = self.lin1(x)
                    x = x.relu()
                    x = F.dropout(x, p=0.5, training=self.training)
                    x = self.lin2(x)
                    return x
            elif gnn_model=='GCN'or'GAT':
                def forward(self, x, edge_index):
                    x = self.conv1(x, edge_index)
                    x = x.relu()
                    x = F.dropout(x, p=0.5, training=self.training)
                    x = self.conv2(x, edge_index)
                    return x

        model = GNN(hidden_channels=16)

        # 可视化未训练过的模型输出的节点表征
        model.eval()
        if gnn_model=='MLP':
            out = model(data.x)
        elif gnn_model=='GCN'or'GAT':
            out = model(data.x, data.edge_index)
        visualize(out, color=data.y,title='before training:'+gnn_model+'_'+Planetoid_dataset)

        # 训练MLP图节点分类器
        criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
        optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)  # Define optimizer.
        def train(gnn_model=gnn_model):
            model.train()
            optimizer.zero_grad()  # Clear gradients.
            if gnn_model=='MLP':
                out = model(data.x)  # Perform a single forward pass.
            elif gnn_model=='GCN'or'GAT':
                out = model(data.x, data.edge_index)  # Perform a single forward pass.
            loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
            loss.backward()  # Derive gradients.
            optimizer.step()  # Update parameters based on gradients.
            return loss
        loss_list = []
        for epoch in range(1, 201):
            loss = train(gnn_model)
            loss_list.append(torch.clone(loss).detach())
        plt.figure(figsize=(9,5))
        plt.plot(loss_list)
        plt.title('training:'+gnn_model+'_'+Planetoid_dataset+f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
        plt.show()

        # 可视化训练过的模型输出的节点表征
        model.eval()
        if gnn_model=='MLP':
            out = model(data.x)
        elif gnn_model=='GCN'or'GAT':
            out = model(data.x, data.edge_index)
        visualize(out, color=data.y,title='after training:'+gnn_model+'_'+Planetoid_dataset)

        # 测试图节点分类器的预测准确率
        def test(gnn_model=gnn_model):
            model.eval()
            if gnn_model=='MLP':
                out = model(data.x)
            elif gnn_model=='GCN'or'GAT':
                out = model(data.x, data.edge_index)
            pred = out.argmax(dim=1)  # Use the class with highest probability.
            test_correct = pred[data.test_mask] == data.y[data.test_mask]  # Check against ground-truth labels.
            test_acc = int(test_correct.sum()) / int(data.test_mask.sum())  # Derive ratio of correct predictions.
            return test_acc
        test_acc = test()
        print(gnn_model+'_'+Planetoid_dataset+f'Test Accuracy: {test_acc:.4f}')

对于最后一个节点最多的数据集MedPub,神奇的现象出现了,三个方法居然结果差别不大,都是百分之七十多。
真正应验了那句老话:天下没有免费的午餐!
0.7390 MLP
0.7860 GCN
0.7540 GAT
那到底是什么原因呢?我也不知道,我也不敢乱说。
感谢群友推荐的论文Pitfalls of Graph Neural Network Evaluation 揭示了这个问题,解决了我的疑虑,一起学习真香啊~
Semi-supervised node classification in graphs is a fundamental problem in graph
mining, and the recently proposed graph neural networks (GNNs) have achieved
unparalleled results on this task. Due to their massive success, GNNs have attracted
a lot of attention, and many novel architectures have been put forward. In this
paper we show that existing evaluation strategies for GNN models have serious
shortcomings
. We show that using the same train/validation/test splits of the
same datasets, as well as making significant changes to the training procedure
(e.g. early stopping criteria) precludes a fair comparison of different architectures.
We perform a thorough empirical evaluation of four prominent GNN models and
show that considering different splits of the data leads to dramatically different
rankings of models. Even more importantly, our findings suggest that simpler
GNN architectures are able to outperform the more sophisticated ones if the
hyperparameters and the training procedure are tuned fairly for all models.

划重点:GNN模型是有严重短板的,模型未必越复杂越好,反而把简单的模型调好效果更加。

Mean test set accuracy and standard deviation in percent averaged over 100 random train/validation/test splits with 20 random weight initializations each for all models and all datasets
很遗憾,图神经网络不是完美的,
很庆幸,图神经网络不是完美的。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值