arxiv数据
Exploring the public ArXiv dataset with Neo4j
使用Neo4j探索公共ArXiv数据集
All scientists know the famous website ArXiv, which makes accessible over 1.7 millions scientific papers in the fields of mathematics, physics, computer science or economy (and the list is not exhaustive!).
所有科学家都知道著名的网站ArXiv ,该网站提供了超过170万份数学,物理学,计算机科学或经济领域的科学论文(列表并不详尽!)。
Recently, the Cornell University, who has been managing ArXiv for 30 years, released a dataset containing all the articles of the platform in the public domain. Information about this dataset can be found here: https://www.kaggle.com/Cornell-University/arxiv (see also, the introduction blog post here).
最近,已经管理ArXiv 30年的康奈尔大学发布了包含公共领域中平台所有文章的数据集。 有关此数据集的信息可以在以下位置找到: https : //www.kaggle.com/Cornell-University/arxiv (另请参见此处的介绍博客文章)。
The dataset uploaded to Kaggle contains metadata for each article (DOI, title, authors, categories, abstract…). Even if the full PDFs are also accessible, we will only use this metadata file in this post.
上传到Kaggle的数据集包含每篇文章的元数据(DOI,标题,作者,类别,摘要等)。 即使也可以访问完整的PDF,在本文中我们也只会使用此元数据文件。
In this post, we will go through this dataset and import the data into Neo4j for further analysis. The steps we are going to follow are:
在本文中,我们将遍历此数据集并将数据导入Neo4j进行进一步分析。 我们要遵循的步骤是:
- Import the data into Neo4j using the Neo4j import tool 使用Neo4j导入工具将数据导入Neo4j
- Simple data analysis 简单的数据分析
资料汇入 (Data import)
Since the dataset is quite large, we will import it using the Neo4j import tool, which is super fast. It processes input CSV files containing the nodes and relationships definition. Some data parsing and formatting is needed beforehand for the data to be understood by this tool so that we are going to:
由于数据集很大,因此我们将使用Neo4j导入工具导入它,它非常快。 它处理包含节点和关系定义的输入CSV文件。 事先需要一些数据解析和格式化,以便此工具可以理解数据,以便我们进行以下操作:
- Data parsing: read the raw data 数据解析:读取原始数据
- Data cleaning: remove duplicates 数据清理:删除重复项
- Data formatting: format the data for Neo4j 数据格式化:格式化Neo4j的数据
- And finally, import the data 最后,导入数