cora数据集

转自:http://blog.sina.com.cn/s/blog_4c98b96001000boc.html --苯苯的小田园

真是找的很辛苦,唉!记下来吧.感谢论文Object Identication with
Attribute-Mediated Dependences提供了cora dataset 的来源:
http://www.cs.umass.edu/~mccallum/data/(如果复制打不开,请自己手动敲到地址栏中)
论文A Pitfall and Solution in Multi-Class Feature Selection for Text Classification提供了启发,cora是有6大类,36个小类的.这样一来终于解决了相关性的难题.

(a)cora-refs.tar.gz数据集
Cora Citation Matching [reference matching, object correspondence]Text of citations hand-clustered into groups referring to the same paper.
(b) cora-ie.tar.gz数据集
Cora Information Extraction [information extraction] Research paper headers and citations, with labeled segments forauthors, title, institutions, venue, date, page numbers and severalother fields.
(c)cora-classify.tar.gz 数据集
Cora Research Paper Classification [relational document lassification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.
(d) cora-hmm.tar.gz
Cora HMM is the C implementation of HMMs used for information extraction in Cora. It was written by Kristie Seymore.


Cora readme

Note that in Cora there are two types of papers: those we found on the
Web, and those that are referenced in bibliography sections. It is
possible that a paper we found on the Web is also referenced by other
papers.


FILE SUMMARY:

* The file 'papers' contains limited information on the papers we found
on the Web.

* The file 'citations' contains the citation.

* The file 'classifications' contains class labels

* The directory `extractions' contains the extracted authors, title,
abstract, etc, plus the references (and in some cases surrounding
text). from the postscript papers we found on the Web.


PAPERS

The file `papers' has a list of all the postscript file papers.
Three fields, tab separated:

<id> <filename> <citation string>

There are about about 52000 lines in this file, but there are a bunch
of papers that have more than one postscript file. If you eliminate
lines with duplicate ids there are about 37000 papers. Note the
citation string is either (1) an arbitrary bibliography reference to
the paper, if one was made or (2) a constructed entry based on the
authors and title extracted from the postscript file.


CITATIONS

The file 'citations' has the citation graph. Two fields, tab
separated:

<referring_id> <cited_id>

The referring_id is the id of the paper that has the bibliography
section (always one we have postscript for). The cited_id is the
paper referenced (we may or may not have postscript for it). There
are about 715000 citations.


CITATIONS.WITHAUTHORS

The file 'citations.withauthors' contains another copy of the
citation graph. This time we have also included authors and file
names of each paper in addition to each papers' unique paper_id and
the paper_id's of the references they make. The format of this file
is:

***
this_paper_id
filename
id_of_first_cited_paper
id_of_second_cited_paper
.
.
.
*
Author#1 (of this paper)
Author#2
.
.
.

CLASSIFICATIONS

The file `classifications' contains the research topic classifications
for each of the files. The format of the file is:
"filename"+"\t"+"classification". For example:

http:##www.ri.cmu.edu#afs#cs#user#alex#docs#idvl#dl97.ps /Information_Retrieval/Retrieval/


The file name is the url where the paper came translated to file name
by changing / to #. The classification the label name in the Cora
directory hierarchy.

Note that the class labels were not perfectly assigned.


EXTRACTIONS

The directory 'extractions' contains 52906 files, one for each
postscript paper that we found on the Web. The directory contains so
many files, that you probably don't want to 'ls' it. Commands like
`find extractions -print' will probably work more efficiently.

Each filename in the 'papers' file should have a file here. I believe
there are also some (perhaps many?) extra files in this tarball that
are not in paper-data that you can just ignore.

Each line of each file corresponds to some bit of data about the
postscript file. Most of the MIME-like field tags are
straightforward and explanatory. A few notes:

The fields URL, Refering-URL, and Root-URL are given by the spider.
All other fields are extracted automatically from the text, some by
hand-coded regular expressions and some by an HMM information
extractor.

The fields Abstract-found and Intro-found are binary valued indicators
of whether Abstract and/or Introduction sections were found by some
regular expression matching in the paper.

Each Reference field is one bibliography entry found at the end of the
paper. Note they are marked up using SGML-like tags. Each Reference
field is optionally followed by one (and possibly more?)
Reference-context fields that are snippets of the postscript file
around where the reference was cited.
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: GCN(Graph Convolutional Network)是一种用于图结构数据的深度学习模型,而Cora数据集是一个常用的用于研究GCN模型性能的基准数据集Cora数据集是由论文《Revisiting Semi-Supervised Learning with Graph Embeddings》中提出的,用于研究半监督学习与图嵌入方法。它包含了一个引文网络,其中节点代表了学术论文,边表示两篇论文之间的引用关系。数据集中的每篇论文都有一个包含1433个特征的特征向量,这些特征向量是通过将每篇论文的标题和摘要转化为词向量、计算TF-IDF得到的。 在Cora数据集中,论文被分为7个不同的类别(如机器学习、神经网络、数据库等)。数据集总共包含2708个节点(论文),其中有140个节点(论文)带有类别标签,其余节点没有标签。因此,Cora数据集被广泛用于基于图结构的半监督学习问题的研究中。 GCN模型可以用于Cora数据集的半监督学习任务。模型接受Cora数据集的邻接矩阵和特征矩阵作为输入。通过对邻接矩阵进行卷积操作,并结合特征矩阵,GCN模型能够通过学习节点之间的关系以及节点的特征信息来预测未标记节点的标签。 研究者可以使用Cora数据集来验证自己所提出的GCN模型在半监督学习任务上的性能。当然,Cora数据集也可以用于其他与引文网络相关的研究,如节点分类、链路预测等。 总而言之,Cora数据集为研究者提供了一个用于验证GCN模型性能以及进行其他引文网络相关研究的标准数据集,通过该数据集可以促进图神经网络领域的发展。 ### 回答2: GCN(Graph Convolutional Network)是一种用于图数据学习的深度学习模型,可以学习节点的表示和图的关系。Cora数据集是一个常用的图数据集,用于评估和比较不同的图学习算法。 Cora数据集包含一个包含2708个科学论文的引文网络。这些论文分为7个类别,其中每个类别对应着一个研究领域。引文网络的节点表示论文,边表示论文间的引用关系。论文的特征向量是词频的One-Hot编码,而边缘是无向的。 在使用GCN对Cora数据集进行训练时,首先需要将图结构转换为邻接矩阵的表示。邻接矩阵中的每个元素代表两个节点之间的连接情况。随后,需要为每个节点生成初试的特征向量表示。GCN模型通过多层的图卷积操作来学习节点表示。 在训练过程中,GCN会通过前向传播和反向传播来更新权重,使得模型能够尽可能地准确地预测每个节点的类别。通过迭代训练,GCN模型可以逐渐提升对节点表示和图结构关系的学习能力。 在使用Cora数据集进行训练时,我们可以评估模型在节点分类任务上的性能。即给定一个节点,预测其所属的类别。通常,我们可以将数据集划分为训练集、验证集和测试集,并使用验证集来调整超参数,通过测试集来评估模型的泛化能力。 总之,GCN模型是一种用于图数据学习的强大工具,在Cora数据集上的应用可以帮助我们更好地理解和分析引文网络中的关系。 ### 回答3: GCN(Graph Convolutional Network)是一种用于图数据的深度学习模型,而Cora数据集则是用于GNN模型训练和评估的常用数据集之一。 Cora数据集是由Jon Kleinberg设计和发布的,用于文本分类任务。该数据集包含了从一系列研究论文中提取出的2708个文档的特征。这些文档分为7个类别,即机器学习、数据库、人类智能、设计与分析、系统、理论和数据结构。同时,这些文档之间的引用关系被用作图结构,通过边来表示不同文档之间的引用关系。这个图表示了论文之间的知识传播和交互。 在GCN中,每个节点代表一个文档,而边代表了文档之间的引用关系。对于Cora数据集而言,每个节点都有一个特征向量,包含了关于论文的内容信息。GCN模型通过使用图卷积神经网络的聚合操作来从邻居节点中汇聚信息,并将这些信息进行特征提取和表示学习。 训练一个GCN模型需要将Cora数据集划分为训练集、验证集和测试集。通常,将140个样本用作训练集,500个样本用作验证集,剩余的2068个样本用作测试集。在训练过程中,GCN模型将根据训练集上的标签信息进行参数反向传播和优化,以减小预测标签与真实标签之间的差距。 通过训练GCN模型,并使用Cora数据集进行评估,我们可以评估GCN模型在文本分类任务中的性能。通过计算模型在测试集上的准确率或其他性能指标,我们可以了解其在准确地预测不同文档的类别方面的能力。在实际应用中,GCN模型和Cora数据集可以被用于许多图数据相关的任务,如社交网络分析、推荐系统等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值