数据分析-学术前沿趋势分析-论⽂数据统计

论文数据统计

数据集简介

数据集的格式如下:

id:arXiv ID,可用于访问论文;

submitter:论文提交者;

authors:论文作者;

title:论文标题;

comments:论文页数和图表等其他信息;

journal-ref:论文发表的期刊的信息;

doi:数字对象标识符,https://www.doi.org;

report-no:报告编号;

categories:论文在 arXiv 系统的所属类别或标签;

license:文章的许可证;

abstract:论文摘要;

versions:论文版本;

authors_parsed:作者的信息。

arxiv论文类别介绍

我们从arxiv官⽹,查询到论⽂的类别名称以及其解释如下。
链接: https://arxiv.org/help/api/user-manual 的 5.3 ⼩节的 Subject Classifications 的部分,或
https://arxiv.org/category_taxonomy, 具体的153种paper的类别部分如下:

'''
'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'''
"\n'astro-ph': 'Astrophysics',\n'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',\n'astro-ph.EP': 'Earth and Planetary Astrophysics',\n'astro-ph.GA': 'Astrophysics of Galaxies',\n'cs.AI': 'Artificial Intelligence',\n'cs.AR': 'Hardware Architecture',\n'cs.CC': 'Computational Complexity',\n'cs.CE': 'Computational Engineering, Finance, and Science',\n'cs.CV': 'Computer Vision and Pattern Recognition',\n'cs.CY': 'Computers and Society',\n'cs.DB': 'Databases',\n'cs.DC': 'Distributed, Parallel, and Cluster Computing',\n'cs.DL': 'Digital Libraries',\n'cs.NA': 'Numerical Analysis',\n'cs.NE': 'Neural and Evolutionary Computing',\n'cs.NI': 'Networking and Internet Architecture',\n'cs.OH': 'Other Computer Science',\n'cs.OS': 'Operating Systems',\n"

具体代码实现以及讲解

导入包并读取原始数据

# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息(用于封装http)
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
# 读入数据
data  = []

#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open("arxiv-metadata-oai-2019.json", 'r') as f: 
    for idx, line in enumerate(f): #enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据索引,一般用在 for 循环当中。
        
        # 读取前100行,如果读取所有数据需要8G内存
        if idx >= 100:
            break
        
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析
data.shape #显示数据大小
(100, 14)
data.head() #显示数据的前五⾏
id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None astro-ph None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... 2019-08-19 [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None math.AT None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... 2019-08-19 [[Dugmore, B., ], [Ntumba, PP., ]]
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None astro-ph None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... 2019-08-19 [[Zaqarashvili, T. V., ], [Murawski, K, ]]
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None gr-qc None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... 2019-10-21 [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None astro-ph None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... 2019-08-19 [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...

数据预处理

粗略统计论文的种类信息

count:⼀列数据的元素个数;

unique:⼀列数据中元素的种类;

top:⼀列数据中出现频率最⾼的元素;

freq:⼀列数据中出现频率最⾼的元素的个数;

data["categories"].describe()
count          100
unique          31
top       astro-ph
freq            46
Name: categories, dtype: object

以上的结果表明:共有100个数据,有31个⼦类(因为有论⽂的类别是多个,例如⼀篇paper的
类别是CS.AI & CS.MM和⼀篇paper的类别是CS.AI & CS.OS属于不同的⼦类别,这⾥仅仅是粗略统
计),其中最多的种类是astro-ph,即Astrophysics(天体物理学),共出现了46次。

判断共出现多少独立种类

这⾥使⽤了 split 函数将多类别使⽤ “ ”(空格)分开,组成list,并使⽤ for 循环将独⽴出现的类别找出
来,并使⽤ set 类别,将重复项去除得到最终所有的独⽴paper种类。

# 所有的种类(独⽴的)
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]]
for i in l])
len(unique_categories)
unique_categories
{'astro-ph',
 'cond-mat.mes-hall',
 'cond-mat.str-el',
 'cs.FL',
 'cs.LO',
 'cs.NI',
 'gr-qc',
 'hep-ex',
 'hep-ph',
 'hep-th',
 'math-ph',
 'math.AC',
 'math.AG',
 'math.AT',
 'math.CA',
 'math.CO',
 'math.CV',
 'math.DG',
 'math.DS',
 'math.FA',
 'math.GR',
 'math.LO',
 'math.MP',
 'math.PR',
 'math.RA',
 'math.SG',
 'math.SP',
 'nlin.CD',
 'nucl-ex',
 'physics.acc-ph',
 'physics.class-ph',
 'physics.comp-ph',
 'quant-ph'}

Python split() 通过指定分隔符对字符串进行切片,如果参数 num 有指值,则分隔 num+1 个子字符串

split() 方法语法:
str.split(str="", num=string.count(str)).

参数

  • str – 分隔符,默认为所有的空字符,包括空格、换行(\n)、制表符(\t)等。
  • num – 分割次数。默认为 -1, 即分隔所有。

**返回值:**返回分割后的字符串列表。

我们的任务要求对于2019年以后的paper进⾏分析,所以⾸先对于时间特征进⾏预处理,从⽽得到2019
年以后的所有种类的论⽂

data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式,并提取处year
#to_datetime之后就可以用神奇的pandas.Series.dt.day或者pandas.Series.dt.month等方法获取到真实数据了!

del data["update_date"] #删除 update_date特征,其使命已完成

data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据,并将其他数据删除
# data.groupby(['categories','year']) #以 categories 进⾏排序,如果同⼀个categories相同则使⽤ year 特征进⾏排序

data.reset_index(drop=True, inplace=True) #重新编号
data #查看结果
id submitter authors title comments journal-ref doi report-no categories license abstract versions authors_parsed year
0 0704.0297 Sung-Chul Yoon Sung-Chul Yoon, Philipp Podsiadlowski and Step... Remnant evolution after a carbon-oxygen white ... 15 pages, 15 figures, 3 tables, submitted to M... None 10.1111/j.1365-2966.2007.12161.x None astro-ph None We systematically explore the evolution of t... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... 2019
1 0704.0342 Patrice Ntumba Pungu B. Dugmore and PP. Ntumba Cofibrations in the Category of Frolicher Spac... 27 pages None None None math.AT None Cofibrations are defined in the category of ... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Dugmore, B., ], [Ntumba, PP., ]] 2019
2 0704.0360 Zaqarashvili T.V. Zaqarashvili and K Murawski Torsional oscillations of longitudinally inhom... 6 pages, 3 figures, accepted in A&A None 10.1051/0004-6361:20077246 None astro-ph None We explore the effect of an inhomogeneous ma... [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... [[Zaqarashvili, T. V., ], [Murawski, K, ]] 2019
3 0704.0525 Sezgin Ayg\"un Sezgin Aygun, Ismail Tarhan, Husnu Baysal On the Energy-Momentum Problem in Static Einst... This submission has been withdrawn by arXiv ad... Chin.Phys.Lett.24:355-358,2007 10.1088/0256-307X/24/2/015 None gr-qc None This paper has been removed by arXiv adminis... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... 2019
4 0704.0535 Antonio Pipino Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... The Formation of Globular Cluster Systems in M... 32 pages (referee format), 9 figures, ApJ acce... Astrophys.J.665:295-305,2007 10.1086/519546 None astro-ph None The most massive elliptical galaxies show a ... [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... 2019
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 0705.3267 Valeri Makarov V. V. Makarov and D. W. Murphy The local stellar velocity field via vector sp... accepted in AJ Astron.J.134:367-375,2007 10.1086/518242 None astro-ph None We analyze the local field of stellar tangen... [{'version': 'v1', 'created': 'Tue, 22 May 200... [[Makarov, V. V., ], [Murphy, D. W., ]] 2019
96 0705.3638 Eilat Glikman Eilat Glikman (1), S. G. Djorgovski (1), Danie... Discovery of Two Spectroscopically Peculiar, L... 15 pages, 5 figures, Accepted for publicated i... None 10.1086/520085 None astro-ph None We report the discovery of two low-luminosit... [{'version': 'v1', 'created': 'Thu, 24 May 200... [[Glikman, Eilat, , Caltech], [Djorgovski, S. ... 2019
97 0705.3769 Marc Schumann Marc Schumann (for the PERKEO II collaboration) Precision Measurements in Neutron Decay 6 pages, to appear in the proceedings of the X... None None None hep-ph None We present new precision measurements of ang... [{'version': 'v1', 'created': 'Fri, 25 May 200... [[Schumann, Marc, , for the PERKEO II collabor... 2019
98 0705.3804 Koji Terashi Koji Terashi (for the CDF and D0 Collaborations) Exclusive e+e-, Di-photon and Di-jet Productio... 4 pages, To be submitted to the proceedings of... None None None hep-ex None Results from studies on exclusive production... [{'version': 'v1', 'created': 'Fri, 25 May 200... [[Terashi, Koji, , for the CDF and D0 Collabor... 2019
99 0705.3857 Niels Martin M{\o}ller Niels Martin Moller Extremal metrics for spectral functions of Dir... 45 pages; title and content edited to reflect ... Adv. Math. 229 (2012), no. 2, 1001--1046. MR28... 10.1016/j.aim.2011.10.012 None math.SP math.DG http://arxiv.org/licenses/nonexclusive-distrib... Let (M^n, g) be a closed smooth Riemannian s... [{'version': 'v1', 'created': 'Fri, 25 May 200... [[Moller, Niels Martin, ]] 2019

100 rows × 14 columns

因为我们用的就是19年的数据,所以输出的就是原来的100行,下⾯我们挑选出计算机领域内的所有⽂章:(其实不是的,只是分了类,到可视化后面才挑选出计算机的文章)

#爬取所有的类别
website_url = requests.get('https://arxiv.org/category_taxonomy').text #获取⽹⻚的⽂本数据
soup = BeautifulSoup(website_url,'lxml') #爬取数据,这⾥使⽤lxml的解析器,加速
root = soup.find('div',{
   'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签⼊⼝,相当于每篇文章信息的开头第一行
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags,分别读取该篇文章信息的"h2(从属类别)","h3","h4(从属类别里的小类)","p(简述)"四个部分

在这里插入图片描述

代码解释

** Request库的get()方法:**

最通常的方法是通过r=request.get(url)构造一个向服务器请求资源的url对象。

这个对象是Request库内部生成的。

这时候的r返回的是一个包含服务器资源的Response对象。包含从服务器返回的所有的相关资源。

  • url是什么?

url是通过http协议存取资源的一个路径,它就像我们电脑里面的一个文件的路径一样。

  • 这个函数完整的使用方法有三个参数:
    在这里插入图片描述
root #id="category_taxonomy_list",意思是定义了一个id,它的值是“category_taxonomy_list”;class="accordion-head"意思是定义了一个类,它的类名是“accordion-head”;
<div class="large-data-list" id="category_taxonomy_list"><h2 class="accordion-head">Computer Science</h2>
<div class="accordion-body">
<div class="columns"><div class="column">
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.AI <span>(Artificial Intelligence)</span></h4>
</div>
<div class="column"><p>Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.AR <span>(Hardware Architecture)</span></h4>
</div>
<div class="column"><p>Covers systems organization and hardware architecture. Roughly includes material in ACM Subject Classes C.0, C.1, and C.5.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.CC <span>(Computational Complexity)</span></h4>
</div>
<div class="column"><p>Covers models of computation, complexity classes, structural complexity, complexity tradeoffs, upper and lower bounds. Roughly includes material in ACM Subject Classes F.1 (computation by abstract devices), F.2.3 (tradeoffs among complexity measures), and F.4.3 (formal languages), although some material in formal languages may be more appropriate for Logic in Computer Science. Some material in F.2.1 and F.2.2, may also be appropriate here, but is more likely to have Data Structures and Algorithms as the primary subject area.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.CE <span>(Computational Engineering, Finance, and Science)</span></h4>
</div>
<div class="column"><p>Covers applications of computer science to the mathematical modeling of complex systems in the fields of science, engineering, and finance. Papers here are interdisciplinary and applications-oriented, focusing on techniques and tools that enable challenging computational simulations to be performed, for which the use of supercomputers or distributed computing platforms is often required. Includes material in ACM Subject Classes J.2, J.3, and J.4 (economics).</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.CG <span>(Computational Geometry)</span></h4>
</div>
<div class="column"><p>Roughly includes material in ACM Subject Classes I.3.5 and F.2.2.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.CL <span>(Computation and Language)</span></h4>
</div>
<div class="column"><p>Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.CR <span>(Cryptography and Security)</span></h4>
</div>
<div class="column"><p>Covers all areas of cryptography and security including authentication, public key cryptosytems, proof-carrying code, etc. Roughly includes material in ACM Subject Classes D.4.6 and E.3.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.CV <span>(Computer Vision and Pattern Recognition)</span></h4>
</div>
<div class="column"><p>Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.CY <span>(Computers and Society)</span></h4>
</div>
<div class="column"><p>Covers impact of computers on society, computer ethics, information technology and public policy, legal aspects of computing, computers and education. Roughly includes material in ACM Subject Classes K.0, K.2, K.3, K.4, K.5, and K.7.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.DB <span>(Databases)</span></h4>
</div>
<div class="column"><p>Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.DC <span>(Distributed, Parallel, and Cluster Computing)</span></h4>
</div>
<div class="column"><p>Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing. Roughly includes material in ACM Subject Classes C.1.2, C.1.4, C.2.4, D.1.3, D.4.5, D.4.7, E.1.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.DL <span>(Digital Libraries)</span></h4>
</div>
<div class="column"><p>Covers all aspects of the digital library design and document and text creation. Note that there will be some overlap with Information Retrieval (which is a separate subject area). Roughly includes material in ACM Subject Classes H.3.5, H.3.6, H.3.7, I.7.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.DM <span>(Discrete Mathematics)</span></h4>
</div>
<div class="column"><p>Covers combinatorics, graph theory, applications of probability. Roughly includes material in ACM Subject Classes G.2 and G.3.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.DS <span>(Data Structures and Algorithms)</span></h4>
</div>
<div class="column"><p>Covers data structures and analysis of algorithms. Roughly includes material in ACM Subject Classes E.1, E.2, F.2.1, and F.2.2.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.ET <span>(Emerging Technologies)</span></h4>
</div>
<div class="column"><p>Covers approaches to information processing (computing, communication, sensing) and bio-chemical analysis based on alternatives to silicon CMOS-based technologies, such as nanoscale electronic, photonic, spin-based, superconducting, mechanical, bio-chemical and quantum technologies (this list is not exclusive). Topics of interest include (1) building blocks for emerging technologies, their scalability and adoption in larger systems, including integration with traditional technologies, (2) modeling, design and optimization of novel devices and systems, (3) models of computation, algorithm design and programming for emerging technologies.</p></div>
</div>
<div class="columns divided">
<div class="column is-one-fifth">
<h4>cs.FL <span>(Formal Languages and Automata Theory)</span></h4>
</div>
<div class="column"><p>Covers automata theory, formal language theory, grammars, and combinatorics on words. Th
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值