发现PyG已经有了封装好的数据加载、预处理模块了。感觉自己之前处理Cora、Citeseer、Pubmed都白搞了。所以现在我决定站在巨人的肩膀上😂,PyG大法好啊!
参考资料:https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html
需要的第三方库
- torch
- torch_geometric
我的代码:https://github.com/ytchx1999/GNN-Dataset/blob/main/Citation.ipynb
from torch_geometric.datasets import Planetoid
import torch
1.Cora数据集的处理
1.1 下载数据集
# 下载并保存预处理的数据集
dataset_cora = Planetoid(root='./cora/', name='Cora')
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!
# 打印数据集
print(dataset_cora)
Cora()
1.2 法一:使用[0]方式从dataset中提取data
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
cpu
# 提取data,并转换为device格式
data_cora = dataset_cora[0].to(device)
# 打印dataset的属性
print(dataset_cora.num_classes) # 标签的类别数量
print(dataset_cora.num_node_features) # 节点特征的维度
print(len(dataset_cora)) # 数据集图的个数
# 打印data
print(data_cora)
7
1433
1
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
1.3 查看data的各项属性
# 提取各项属性
x = data_cora.x # 节点的特征矩阵[N,input_dim]
edge_index