GNN常用数据集之Cora数据集

好运来2333

已于 2022-04-08 18:53:42 修改

阅读量3.3w

点赞数 54

分类专栏： DeepLearning 文章标签：深度学习

于 2020-01-02 09:33:37 首次发布

本文链接：https://blog.csdn.net/qq_33254870/article/details/103553661

版权

DeepLearning 专栏收录该内容

32 篇文章

订阅专栏

本文详细介绍了Cora数据集，一个用于图神经网络(GNN)研究的常见基准。Cora由2708篇机器学习论文组成，分为7个类别，通过引用关系构建图结构。文章提供了数据集的下载链接，解释了.cor内容文件和.cites文件的格式，展示了如何使用Python读取和处理数据，构建邻接矩阵，提取特征向量和标签，为GNN项目提供数据准备指导。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在学习图神经网络 GNN 之前，必然要了解一些GNN的常用数据集，这篇博客主要以Cora数据集为例介绍GNN的数据集格式与读取方式，并以一个项目实例进行说明。
GNN常用数据集：https://linqs.soe.ucsc.edu/data
在这里插入图片描述

1. Cora数据集介绍

Cora数据集下载地址：https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz
以下内容均是对Cora数据集文件夹中README文件的翻译与解读，为避免误导读者，现作此说明，若有误请告知更正，感谢各位大牛不吝指教！

1.1 数据集概括

Cora数据集由机器学习论文组成。这些论文分为以下七个类别之一：

基于案例
遗传算法
神经网络
概率方法
强化学习
规则学习
理论

这些论文的选择方式是，在最终语料库中，每篇论文引用或被至少一篇其他论文引用。整个语料库中有 2708篇论文。

在词干堵塞和去除词尾后，只剩下 1433个唯一的单词。文档频率小于10的所有单词都被删除。

1.2 数据集文件说明

该数据集由 cora.cites 与 cora.content 两个文件组成。

1.2.1 cora.content

The .content file contains descriptions of the papers in the following format:
<paper_id> <word_attributes>+ <class_label>

The first entry in each line contains the unique string ID of the paper followed by binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the paper. Finally, the last entry in the line contains the class label of the paper.

.content文件包含以下格式的论文描述：<paper_id> <word_attributes>+ <class_label>

每行（其实就是图的一个节点）的第一个字段是论文的唯一字符串标识，后跟 1433 个字段（取值为二进制值），表示1433个词汇中的每个单词在文章中是存在(由1表示)还是不存在(由0表示)。最后，该行的最后一个字段表示论文的类别标签（7个）。因此该数据的特征应该有 1433 个维度，另外加上第一个字段 idx，最后一个字段 label，一共有 1433 + 2 个维度。
在这里插入图片描述

1.2.2 cora.cites

The .cites file contains the citation graph of the corpus. Each line describes a link in the following format：
< ID of cited paper > < ID of citing paper>

Each line contains two paper IDs. The first entry is the ID of the paper being cited and the second ID stands for the paper which contains the citation. The direction of the link is from right to left. If a line is represented by “paper1 paper2” then the link is “paper2->paper1”.

.cites文件包含语料库的引用关系‘图’。
每行（其实就是图的一条边）用以下格式描述一个引用关系：<被引论文编号> <引论文编号>

每行包含两个paper id。第一个字段是被引用论文的标识，第二个字段代表引用的论文。引用关系的方向是从右向左。如果一行由“论文1 论文2”表示，则“论文2 引用论文1”，即链接是“论文2 - >论文1”。可以通过论文之间的链接（引用）关系建立邻接矩阵adj。
在这里插入图片描述

2. Python对数据进行处理

2.1 读取文件

import pandas as pd
import numpy as np

raw_data = pd.read_csv('cora.content', sep='\t', header=None)
print("content shape: ", raw_data.shape)

raw_data_cites = pd.read_csv('cora.cites', sep='\t', header=None)
print("cites shape: ", raw_data_cites.shape)

读取两个1.2节中的两个文件，并查看数据维度。
在这里插入图片描述
可以发现，content文件数据一共有2708条记录（即图有2708个节点），每条记录有 1435 个维度，其中 1433 个维度是记录的 feature（即1433个词汇）。cites文件数据一共有5429条记录（即图有5429条边），每条记录包括引用与被引用 paper id。

2.2 提取特征向量与标签

import pandas as pd

raw_data = pd.read_csv('cora.content', sep='\t', header=None)
print("content shape: ", raw_data.shape)

features = raw_data.iloc[:,1:-1]
print("features shape: ", features.shape)

# one-hot encoding
labels = pd.get_dummies(raw_data[1434])
print("\n----head(3) one-hot label----")
print(labels.head(3))

在这里插入图片描述

2.3 构建邻接矩阵

import pandas as pd
import numpy as np

raw_data = pd.read_csv('cora.content', sep='\t', header=None)
num_nodes = raw_data.shape[0]

# 将节点重新编号为[0, 2707]
new_id = list(raw_data.index)
id = list(raw_data[0])
c = zip(id, new_id)
map = dict(c)

raw_data_cites = pd.read_csv('cora.cites', sep='\t', header=None)

# 根据节点个数定义矩阵维度
matrix = np.zeros((num_nodes,num_nodes))

# 根据边构建矩阵
for i ,j in zip(raw_data_cites[0],raw_data_cites[1]):
    x = map[i] ; y = map[j]
    matrix[x][y] = matrix[y][x] = 1   # 无向图：有引用关系的样本点之间取1

# 查看邻接矩阵的元素
print(matrix)