Task2:环境配置与PyG库
这一节主要学习Pytorch Geometric库,主要是里面的Data类和Dataset类。
环境配置
- 正确安装显卡驱动
- 安装pytorch和cudatoolkit
- 安装PyG
pip install torch-scatter -f https://pytorch- geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-sparse -f https://pytorch- geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-cluster -f https://pytorch- geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-spline-conv -f https://pytorch- geometric.com/whl/torch-1.8.0+cu111.html
pip install torch-geometric
Data类
Data类的构造函数:
class Data(object):
def __init__(self, x=None, edge_index=None,
edge_attr=None, y=None, **kwargs):
"""
x (Tensor, optional): 节点属性矩阵,大小为`[num_nodes, num_node_features]`
edge_index (LongTensor, optional): 边索引矩阵,大小为`[2, num_edges]`,第0行为尾节点,第1行为头节点,头指向尾
edge_attr (Tensor, optional): 边属性矩阵,大小为`[num_edges, num_edge_features]`
y (Tensor, optional): 节点或图的标签,任意大小(,其实也可以是边的标签)
"""
self.x = x
self.edge_index = edge_index self.edge_attr = edge_attr
self.y = y
for key, item in kwargs.items():
if key == 'num_nodes':
self.__num_nodes__ = item
else:
self[key] = item
Dataset类
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/dataset/Cora', name='Cora')
# Cora()
len(dataset) #1
dataset.num_classes #7
dataset.num_node_features # 1433
分析数据集中的样本:
data = dataset[0]
# Data(edge_index=[2, 10556], test_mask=[2708],
# train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
data.is_undirected()
# True
data.train_mask.sum().item()
# 140
11 data.val_mask.sum().item()
# 500
data.test_mask.sum().item()
# 1000
这里有个疑问就是总共2708个节点,训练集、验证集、测试集加起来是140+500+1000=1640个节点,那其他节点是用来干什么的?
作业
请通过继承Data类实现一个类,专门用于表示“机构-作者-论文”的网络。该网络包含“机构“、”作者“和”论文”三类节点,以及“作者-机构“和“作者-论文“两类边。对要实现的类的要求:1)用不同的属性存储不同节点的属性;2)用不同的属性存储不同的边(边没有属性);3)逐一实现获取不同节点数量的方法。
from torch_geometric.data import Data
class MyData(Data):
def __init__(self, x=None, edge_index=None, edge_attr=None, y=None, **kwargs):
r"""
Args:
x: list[x_institution,x_author,x_paper], 节点属性矩阵x_i:[num_nodes, num_node_feature]
edge_index: list[e_ins_author, e_author_paper], 边索引矩阵e_i:[2, num_edges]
edge_attr: list, 边属性矩阵e_i:[num_edges, num_edge_features]
y (Tensor, optional): 节点或图的标签
"""
super(MyData, self).__init__()
self.institution, self.author, self.paper = x
self.ins_author, self.author_paper = edge_index
self.ins_author_att, self.author_paper_att = self.edge_attr
self.y = y
for key, item in kwargs.items():
if key == 'num_nodes':
self.__num_nodes__ = item
else:
self[key] = item
def get_node_num(self, node_type):
if node_type == 'institution':
return self.get_institution_num
elif node_type == 'author':
return self.get_author_num
elif node_type == 'paper':
return self.get_paper_num
else:
institution = self.get_institution_num
author = self.get_author_num
paper = self.get_paper_num
return institution, author, paper
@property
def get_institution_num(self):
return len(self.institution)
@property
def get_author_num(self):
return len(self.author)
@property
def get_paper_num(self):
return len(self.paper)
写完之后,对如何把原先的数据集抽象成GNN中节点,以及节点中的关系怎么抽象成边有了理解。