（DataWhale）图神经网络Task01：基于PyG包的图数据的表示与使用

最新推荐文章于 2024-08-22 14:16:01 发布

misite_J

最新推荐文章于 2024-08-22 14:16:01 发布

阅读量551

点赞数

分类专栏： DataWhale 文章标签： python GNN

本文链接：https://blog.csdn.net/misite_J/article/details/117934627

版权

DataWhale 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

文章目录

PyG`Data`类初识
- `graph_data`对象的创建：
- `graph_data`实例初探
PyG`Dataset`类初识
- 内置数据集（Cora）的下载
作业1

PyG`Data`类初识

`graph_data`对象的创建：

通过torch_geometric.data.Data构造函数，即graph_data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr, y=y, num_nodes=num_nodes, other_attr=other_attr)。
使用Data将一个dict对象转换为一个Data对象，即graph_data = Data.from_dict(graph_dict)。

Data中的edge_index属性表示COO格式的边索引矩阵，大小为[2, num_edges]。简单来说，也就是只存储图邻接矩阵中的非零值，即不同节点间存在的边。edge_index矩阵第0行表示尾节点，第1行表示头节点，边从头指向尾，示例见下图：

图的邻接矩阵：使用方阵A表示有n个顶点构成的图。

对于有向图，行索引为头，列索引为尾，头指向尾，即邻接矩阵的第i 行非零元素的个数正好是第i 个顶点的出度。

稀疏矩阵的存储格式：COO、CSR、CSC。

COO：使用三个数组row,column和data分别用来存储非零元素坐标的row_index,col_index,以及数值；
CSR：对COO稀疏矩阵存储格式的三个数组中的row数组进行压缩，其他两个数组保持不变。三个数组分别是row_ptr,columns和data，其中，row_ptr长度为M+1（M为稀疏矩阵行数），第i行的非零元素有**row_ptr[i+1]-row_ptr[i]**个，为data[row_ptr[i]:row_ptr[i+1]]；
CSC：对COO稀疏矩阵存储格式的column数组进行压缩，其他两个数组保持不变。

`graph_data`实例初探

Zachary karate club网络是通过对一个美国大学空手道俱乐部进行观测而构建出的一个社会网络，网络包含 34 个节点和 78 条边，其中节点表示俱乐部中的成员，而边表示成员之间存在的友谊关系。

from torch_geometric.datasets import KarateClub

dataset = KarateClub()
data = dataset[0]  # Get the first graph object.
print(data)
# Data(edge_index=[2, 156], train_mask=[34], x=[34, 34], y=[34])
# x=[34, 34]：34个节点，节点属性的维度为34
# edge_index=[2, 156]：156条边，矩阵第0行表示尾节点，第1行表示头节点，边从头指向尾。
print('==============================================================')

# 显示图的部分属性信息
print(f'节点数量: {data.num_nodes}')
print(f'边数量: {data.num_edges}')
print(f'节点属性的维度: {data.num_node_features}')
print(f'节点属性的维度: {data.num_features}')
print(f'边属性的维度: {data.num_edge_features}')
print(f'平均节点度: {data.num_edges / data.num_nodes:.2f}')
print(f'边是否有序且不含重复边: {data.is_coalesced()}')
print(f'用作训练集的节点: {data.train_mask.sum()}')
print(f'是否包含孤立的节点: {data.contains_isolated_nodes()}')
print(f'是否包含自环的边: {data.contains_self_loops()}')
print(f'是否是无向图: {data.is_undirected()}')

PyG`Dataset`类初识

内置数据集（Cora）的下载

from torch_geometric.datasets import Planetoid

# 程序先下载原始文件，然后将原始文件处理成包含`Data`对象的`Dataset`对象并保存到文件
dataset = Planetoid(root='/dataset', name='Cora')

上述过程报错：内置下载地址无法访问“https://github.com/kimiyoung/planetoid/raw/master/data/";

解决办法：

自行下载数据(‘https://github.com/kimiyoung/planetoid/tree/master/data’ or ‘https://gitee.com/jiajiewu/planetoid/tree/master/data’)，并放置到正确位置‘/dataset/Cora/raw’；
暂时屏蔽planetoid.py中的down_load()函数；
运行原代码，成功。

作业1

请通过继承Data类实现一个类，专门用于表示“机构-作者-论文”的网络。该网络包含“机构“、”作者“和”论文”三类节点，以及“作者-机构“和“作者-论文“两类边。对要实现的类的要求：1）用不同的属性存储不同节点的属性；2）用不同的属性存储不同的边（边没有属性）；3）逐一实现获取不同节点数量的方法。

Python类继承

参考自：https://www.cnblogs.com/bigberg/p/7182741.html
python3中所有类都可以继承于object基类；

父类定义：class FooParen(object)；子类继承：class FooChild(FooParen)；

在定义子类的构造函数时，要先继承再构造，这样才能获取父类的属性（Python 3 可以使用直接使用super().xxx 代替 super(Class, self).xxx）。子类构造函数继承父类构造函数过程如下：实例化对象c --> c 调用子类__init__() -- > 子类__init__()继承父类__init__() -- > 调用父类 __init__()；
使用super函数时，可将子类中与父类相同的参数依次写入__init__(xxx)的xxx参数中，self参数已在super()中传入，在__init__()中将隐式传递，不需要写出;
class FooParent(object):
    def __init__(self):
        self.parent = 'I\'m the parent.'
        print ('Parent')
    
    def bar(self,message):
        print ("%s from Parent" % message)
 
class FooChild(FooParent):
    def __init__(self):
        # super(FooChild,self) 首先找到 FooChild 的父类（就是类 FooParent），然后把类 FooChild 的对象转换为类 FooParent 的对象
        super(FooChild,self).__init__()    
        print ('Child')
        
    def bar(self,message):
        super(FooChild, self).bar(message)
        print ('Child bar fuction')
        print (self.parent)
 
if __name__ == '__main__':
    fooChild = FooChild()
    fooChild.bar('HelloWorld')

import torch
from torch_geometric.data import Data
import logging


class SchData(Data):
    def __init__(self, x=None, edge_index=None, node_types=None, edge_types=None, **kwargs):
        '''
        :param node_types: [num_nodes, 1]
        :param edge_types: [num_edges, 1] or None,不提供关系属性时可由节点、节点属性、关系属性推断得知；
        '''
        super().__init__(x, edge_index, **kwargs)
        self.node_types = node_types
        self.edge_types = edge_types

        self.nodes = torch.cat((x, node_types.T), dim=1)
        self.orgs, self.auts, self.paps = self.stastics()

    def stastics(self):
        org, aut, pap = [], [], []
        for idx, type in enumerate(self.nodes[:, 2]):
            if type == 0:
                org.append(idx)
            elif type == 1:
                aut.append(idx)
            elif type == 2:
                pap.append(idx)
            else:
                logging.warning('There is a node of unknown type!')

        return org, aut, pap

    @property
    def node_org(self):
        # 机构节点
        print('Number of organizations: {}'.format(len(self.orgs)))
        return self.nodes[[idx for idx in self.orgs], :-1]

    @property
    def node_aut(self):
        # 作者节点
        print('Number of authors: {}'.format(len(self.auts)))
        return self.nodes[[idx for idx in self.auts], :-1]

    @property
    def node_pap(self):
        # 论文节点
        print('Number of papers: {}'.format(len(self.paps)))
        return self.nodes[[idx for idx in self.paps], :-1]

    @property
    def edge_classify(self):
        edge_10 = []
        edge_12 = []
        if self.edge_types is not None:
            for idx, type in enumerate(self.edge_types[0]):
                if type == 0: #0-作者&机构(1-0)
                    edge_10.append(tuple(self.edge_index[:, idx].numpy().tolist()))
                elif type == 1: #1-作者&论文(1-2)
                    edge_12.append(tuple(self.edge_index[:, idx].numpy().tolist()))
                else:
                    logging.warning('There is a edge of unknown type!')
            return '作者-机构:', edge_10, '作者-论文:', edge_12
        else:
            # TODO
            # 若不提供关系属性也可由节点、节点属性、关系属性推断得知
            pass


if __name__ == '__main__':
    x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1], [1, 2], [2, 0], [2, 1]])  # (7,2)
    edge_index = torch.tensor([[0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6],
                               [2, 3, 4, 0, 5, 6, 0, 5, 1, 6, 2, 3, 2, 4]])  # (2,14)
    node_types = torch.tensor([[0, 0, 1, 1, 1, 2, 2]])  # 0-机构，1-作者，2-论文
    edge_types = torch.tensor([[0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1]])  # 0-作者&机构，1-作者&论文
    # num_nodes = 7 # 不提供的话Data.num_nodes()函数会根据x自动计算

    data = SchData(x=x, edge_index=edge_index, node_types=node_types, edge_types=edge_types)
    print(data)
    print('==============================================================')

    # 获取图的一些信息
    print(f'Number of nodes: {data.num_nodes}')  # 节点数量
    print(f'Number of edges: {data.num_edges}')  # 边数量
    print(f'Number of node features: {data.num_node_features}')  # 节点属性的维度
    print(f'Number of edge features: {data.num_edge_features}')  # 边属性的维度
    print(f'Organization node features: {data.node_org}')  # 机构节点计数及属性
    print(f'Author node features: {data.node_aut}')  # 作者节点计数及属性
    print(f'Ppaper node features: {data.node_pap}')  # 论文节点计数及属性
    print(f'Edge categories: {data.edge_classify}')  # 边类型

# 结果输出：
# SchData(auts=[3], edge_index=[2, 14], edge_types=[1, 14], node_types=[1, 7], nodes=[7, 3], orgs=[2], paps=[2], x=[7, 2])
# ==============================================================
# Number of nodes: 7
# Number of edges: 14
# Number of node features: 2
# Number of edge features: 0
# Number of organizations: 2
# Organization node features: tensor([[0, 0], [0, 1]])
# Number of authors: 3
# Author node features: tensor([[1, 0], [1, 1], [1, 2]])
# Number of papers: 2
# Ppaper node features: tensor([[2, 0], [2, 1]])
# Edge categories: ('作者-机构:', [(0, 2), (0, 3), (1, 4), (2, 0), (3, 0), (4, 1)], 
#                   '作者-论文:', [(2, 5), (2, 6), (3, 5), (4, 6), (5, 2), (5, 3), (6, 2), (6, 4)])