参考开源学习地址:datawhale
1. InMemoryDataset
基类简介
在PyG中,我们通过继承InMemoryDataset
类来自定义一个数据可全部存储到内存的数据集类。
class InMemoryDataset(root: Optional[str] = None, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None)
InMemoryDataset
官方文档:torch_geometric.data.InMemoryDataset
如上方的InMemoryDataset
类的构造函数接口所示,每个数据集都要有一个根文件夹(root
),它指示数据集应该被保存在哪里。在根目录下至少有两个文件夹:
- 一个文件夹为
raw_dir
,它用于存储未处理的文件,从网络上下载的数据集文件会被存放到这里; - 另一个文件夹为
processed_dir
,处理后的数据集被保存到这里。
此外,继承InMemoryDataset
类的每个数据集类可以传递一个transform
函数,一个pre_transform
函数和一个pre_filter
函数,它们默认都为None
。
transform
函数接受Data
对象为参数,对其转换后返回。此函数在每一次数据访问时被调用,所以它应该用于数据增广(Data Augmentation)。pre_transform函数接受 [
Data](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data)对象为参数,对其转换后返回。此函数在样本 [
Data`](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data)对象保存到文件前调用,所以它最好用于只需要做一次的大量预计算。pre_filter
函数可以在保存前手动过滤掉数据对象。该函数的一个用例是,过滤样本类别。
为了创建一个InMemoryDataset
,我们需要实现四个基本方法:
raw_file_names()
这是一个属性方法,返回一个文件名列表,文件应该能在raw_dir
文件夹中找到,否则调用process()
函数下载文件到raw_dir
文件夹。processed_file_names()
。这是一个属性方法,返回一个文件名列表,文件应该能在processed_dir
文件夹中找到,否则调用process()
函数对样本做预处理然后保存到processed_dir
文件夹。download()
: 将原始数据文件下载到raw_dir
文件夹。process()
: 对样本做预处理然后保存到processed_dir
文件夹。
样本从原始文件转换成 Data
类对象的过程定义在process
函数中
import torch
from torch_geometric.data import InMemoryDataset, download_url
class MyOwnDataset(InMemoryDataset):
def __init__(self, root, transform=None, pre_transform=None, pre_filter=None):
super().__init__(root=root, transform=transform, pre_transform=pre_transform, pre_filter=pre_filter)
self.data, self.slices = torch.load(self.processed_paths[0])
@property
def raw_file_names(self):
return ['some_file_1', 'some_file_2', ...]
@property
def processed_file_names(self):
return ['data.pt']
def download(self):
# Download to `self.raw_dir`.
download_url(url, self.raw_dir)
...
def process(self):
# Read data into huge `Data` list.
data_list = [...]
if self.pre_filter is not None:
data_list = [data for data in data_list if self.pre_filter(data)]
if self.pre_transform is not None:
data_list = [self.pre_transform(data) for data in data_list]
data, slices = self.collate(data_list)
torch.save((data, slices), self.processed_paths[0])
在该函数中,有时我们需要读取和创建一个 Data
对象的列表,并将其保存到processed_dir
中。由于python保存一个巨大的列表是相当慢的,因此我们在保存之前通过collate()
函数将该列表集合成一个巨大的 Data
对象。该函数还会返回一个切片字典,以便从这个对象中重构单个样本。最后,我们需要在构造函数中把这Data
对象和切片字典分别加载到属性self.data
和self.slices
中。
2. 节点预测与边预测任务实践
2.1 节点预测任务实践
神经网络模型
class GAT(torch.nn.Module):
def __init__(self, num_features, hidden_channels_list, num_classes):
super(GAT, self).__init__()
torch.manual_seed(12345)
hns = [num_features] + hidden_channels_list
conv_list = []
for idx in range(len(hidden_channels_list)):
conv_list.append((GATConv(hns[idx], hns[idx+1]), 'x, edge_index -> x'))
conv_list.append(ReLU(inplace=True),)
self.convseq = Sequential('x, edge_index', conv_list)
self.linear = Linear(hidden_channels_list[-1], num_classes)
def forward(self, x, edge_index):
x = self.convseq(x, edge_index)
x = F.dropout(x, p=0.5, training=self.training)
x = self.linear(x)
return x
dataset.num_features: 500
GAT( (convseq): Sequential(
(0): GATConv(500, 200, heads=1)
(1): ReLU(inplace=True)
(2): GATConv(200, 100, heads=1)
(3): ReLU(inplace=True) )
(linear): Linear(in_features=100, out_features=3, bias=True)
)
2.2 边预测任务实践
边预测任务,如果是预测两个节点之间是否存在边。
拿到一个图数据集,我们有节点特征矩阵x
,和哪些节点之间存在边的信息edge_index
。edge_index
存储的便是正样本,为了构建边预测任务,我们需要生成一些负样本,即采样一些不存在边的节点对作为负样本边,正负样本应平衡。
此外要将样本分为训练集、验证集和测试集三个集合。
PyG中为我们提供了现成的方法,train_test_split_edges(data, val_ratio=0.05, test_ratio=0.1)
,其第一个参数为torch_geometric.data.Data
对象,第二参数为验证集所占比例,第三个参数为测试集所占比例。该函数将自动地采样得到负样本,并将正负样本分成训练集、验证集和测试集三个集合。它用train_pos_edge_index
、train_neg_adj_mask
、val_pos_edge_index
、val_neg_edge_index
、test_pos_edge_index
和test_neg_edge_index
属性取代edge_index
属性。
神经网络模型
import torch
from torch_geometric.nn import GCNConv
class Net(torch.nn.Module):
def __init__(self, in_channels, out_channels):
super(Net, self).__init__()
self.conv1 = GCNConv(in_channels, 128)
self.conv2 = GCNConv(128, out_channels)
def encode(self, x, edge_index):
x = self.conv1(x, edge_index)
x = x.relu()
return self.conv2(x, edge_index)
def decode(self, z, pos_edge_index, neg_edge_index):
edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1)
return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)
def decode_all(self, z):
prob_adj = z @ z.t()
return (prob_adj > 0).nonzero(as_tuple=False).t()
用于做边预测的神经网络主要由两部分组成:其一是编码(encode),它与我们前面介绍的生成节点表征是一样的;其二是解码(decode),它边两端节点的表征生成边为真的几率(odds)。decode_all(self, z)
用于推断(inference)阶段,我们要对输入节点所有的节点对预测存在边的几率。
作业
- 实践问题一:对节点预测任务,尝试用PyG中的不同的网络层去代替
GCNConv
,以及不同的层数和不同的out_channels
。
"""
使用不同层数的GATConv进行节点分类任务
"""
import os.path as osp
import torch
import torch.nn.functional as F
from torch_geometric.data import (InMemoryDataset, download_url)
from torch_geometric.nn import GATConv, Sequential
from torch_geometric.transforms import NormalizeFeatures
from torch_geometric.io import read_planetoid_data
from torch.nn import Linear, ReLU
class PlanetoidPubMed(InMemoryDataset):
#url = 'https://github.com/kimiyoung/planetoid/raw/master/data'
def __init__(self, root, split="public", num_train_per_class=20,
num_val=500, num_test=1000, transform=None,
pre_transform=None):
super(PlanetoidPubMed, self).__init__(root, transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])
self.split = split
assert self.split in ['public', 'full', 'random']
if split == 'full':
data = self.get(0)
data.train_mask.fill_(True)
data.train_mask[data.val_mask | data.test_mask] = False
self.data, self.slices = self.collate([data])
elif split == 'random':
data = self.get(0)
data.train_mask.fill_(False)
for c in range(self.num_classes):
idx = (data.y == c).nonzero(as_tuple=False).view(-1)
idx = idx[torch.randperm(idx.size(0))[:num_train_per_class]]
data.train_mask[idx] = True
remaining = (~data.train_mask).nonzero(as_tuple=False).view(-1)
remaining = remaining[torch.randperm(remaining.size(0))]
data.val_mask.fill_(False)
data.val_mask[remaining[:num_val]] = True
data.test_mask.fill_(False)
data.test_mask[remaining[num_val:num_val + num_test]] = True
self.data, self.slices = self.collate([data])
@property
def raw_dir(self):
return osp.join(self.root, 'raw')
@property
def processed_dir(self):
return osp.join(self.root, 'processed')
@property
def raw_file_names(self):
names = ['x', 'tx', 'allx', 'y', 'ty', 'ally', 'graph', 'test.index']
return ['ind.pubmed.{}'.format(name) for name in names]
@property
def processed_file_names(self):
return 'data.pt'
def download(self):
for name in self.raw_file_names:
download_url('{}/{}'.format(self.url, name), self.raw_dir)
def process(self):
data = read_planetoid_data(self.raw_dir, 'pubmed')
data = data if self.pre_transform is None else self.pre_transform(data)
torch.save(self.collate([data]), self.processed_paths[0])
def __repr__(self):
return '{}()'.format(self.name)
def train(data, model, optimizer, criterion):
model.train()
optimizer.zero_grad() # Clear gradients.
out = model(data.x, data.edge_index) # Perform a single forward pass.
# Compute the loss solely based on the training nodes.
loss = criterion(out[data.train_mask], data.y[data.train_mask])
loss.backward() # Derive gradients.
optimizer.step() # Update parameters based on gradients.
return loss
def test(data, model):
model.eval()
out = model(data.x, data.edge_index)
pred = out.argmax(dim=1) # Use the class with highest probability.
test_correct = pred[data.test_mask] == data.y[data.test_mask] # Check against ground-truth labels.
test_acc = int(test_correct.sum()) / int(data.test_mask.sum()) # Derive ratio of correct predictions.
return test_acc
class GAT(torch.nn.Module):
def __init__(self, num_features, hidden_channels_list, num_classes):
super(GAT, self).__init__()
torch.manual_seed(12345)
hns = [num_features] + hidden_channels_list
conv_list = []
for idx in range(len(hidden_channels_list)):
conv_list.append((GATConv(hns[idx], hns[idx+1]), 'x, edge_index -> x'))
conv_list.append(ReLU(inplace=True),)
self.convseq = Sequential('x, edge_index', conv_list)
self.linear = Linear(hidden_channels_list[-1], num_classes)
def forward(self, x, edge_index):
x = self.convseq(x, edge_index)
x = F.dropout(x, p=0.5, training=self.training)
x = self.linear(x)
return x
def main():
device = torch.device('cuda' if torch.cuda.torch.cuda.is_available() else 'cpu')
dataset = PlanetoidPubMed('data/PlanetoidPubMed/', transform=NormalizeFeatures())
# print('data.num_features:', dataset.num_features)
data = dataset[0].to(device)
model = GAT(num_features=dataset.num_features, hidden_channels_list=[400, 200, 100], num_classes=dataset.num_classes).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()
for epoch in range(1, 201):
loss =train(data, model, optimizer, criterion)
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
test_acc = test(data, model)
print(f'Test Accuracy: {test_acc:.4f}')
if __name__ == "__main__":
main()
# Test Accuracy: 0.7840
- 实践问题二:对边预测任务,尝试用用
torch_geometric.nn.Sequential
容器构造图神经网络。
源代码
import torch
from torch_geometric.nn import GCNConv
class Net(torch.nn.Module):
def __init__(self, in_channels, out_channels):
super(Net, self).__init__()
self.conv1 = GCNConv(in_channels, 128)
self.conv2 = GCNConv(128, out_channels)
def encode(self, x, edge_index):
x = self.conv1(x, edge_index)
x = x.relu()
return self.conv2(x, edge_index)
def decode(self, z, pos_edge_index, neg_edge_index):
edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1)
return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)
def decode_all(self, z):
prob_adj = z @ z.t()
return (prob_adj > 0).nonzero(as_tuple=False).t()
采用torch_geometric.nn.Sequential
容器(有问题待修改)
class Net(torch.nn.Module):
def __init__(self, in_channels, hidden_channnels_list, out_channels):
super(Net, self).__init__()
torch.manual_seed(12345)
hns = [in_channels] + hidden_channnels_list
conv_list = []
for idx in range(len(hidden_channnels_list)):
conv_list.append((GCNConv(hns[idx], hns[idx+1]), 'x, edge_index -> x'))
conv_list.append(ReLU(inplace=True), ) # inplace表示是否将得到的值计算得到的值覆盖之前的值
self.convseq = Sequential('x, edge_index', conv_list)
def encode(self, x, edge_index):
return self.convseq(x, edge_index)
def decode(self, z, pos_edge_index, neg_edge_index):
edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1)
return (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)
def decode_all(self, z):
prob_adj = z @ z.t()
return (prob_adj > 0).nonzero(as_tuple=False).t()
- 思考问题三:如下方代码所示,我们以data.train_pos_edge_index为实际参数来进行训练集负样本采样,但这样采样得到的负样本可能包含一些验证集的正样本与测试集的正样本,即可能将真实的正样本标记为负样本,由此会产生冲突。但我们还是这么做,这是为什么?
neg_edge_index = negative_sampling(
edge_index=data.train_pos_edge_index,
num_nodes=data.num_nodes,
num_neg_samples=data.train_pos_edge_index.size(1))