学术前沿趋势分析

最新推荐文章于 2024-02-03 09:10:14 发布

miaochangq

最新推荐文章于 2024-02-03 09:10:14 发布

阅读量1.1k

点赞数 2

文章标签： python 大数据机器学习

本文链接：https://blog.csdn.net/miaochangq/article/details/112477255

版权

学术前沿趋势分析

任务一，论文数据统计

任务一，论文数据统计

任务说明

任务主题：统计2019年全年计算机各个方向论文数量
任务内容：赛题理解、使用Pandas读取数据并进行统计

数据集介绍

数据来源：[数据集地址]https://www.kaggle.com/Cornell-University/arxiv

wget https://cdn.coggle.club/arxiv-metadata-oai-2019.json.zip

数据集格式如下：

字段	解释
id	arXiv Id,可用于访问论文
submitter	论文提交者
authors	论文作者
title	论文标题
comments	论文页数和图表等其他信息
journal-ref	论文发表的期刊的信息
doi	数字对象标识符，https://www.doi.org
report-no	报告编号
categories	论文在arXiv系统的所属类别或标签
license	文章的许可证
abstract	论文摘要
version	论文版本
authors_parsed	作者的信息

数据集实例

"root":{
		"id":string"0704.0001"
		"submitter":string"Pavel Nadolsky"
		"authors":string"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan"
		"title":string"Calculation of prompt diphoton production cross sections at Tevatron and LHC energies"
		"comments":string"37 pages, 15 figures; published version"
		"journal-ref":string"Phys.Rev.D76:013009,2007"
		"doi":string"10.1103/PhysRevD.76.013009"
		"report-no":string"ANL-HEP-PR-07-12"
		"categories":string"hep-ph"
		"license":NULL
		"abstract":string"  A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events."
		"versions":[
				0:{
						"version":string"v1"
						"created":string"Mon, 2 Apr 2007 19:18:42 GMT"
					}
				1:{
						"version":string"v2"
						"created":string"Tue, 24 Jul 2007 20:10:27 GMT"
					}]
		"update_date":string"2008-11-26"
		"authors_parsed":[
				0:[
						0:string"Balázs"
						1:string"C."
						2:string""]
				1:[
						0:string"Berger"
						1:string"E. L."
						2:string""]
				2:[
						0:string"Nadolsky"
						1:string"P. M."
						2:string""]
				3:[
						0:string"Yuan"
						1:string"C. -P."
						2:string""]]
}

arXiv论文类别介绍

我们从arXiv官网，查询到论文的类别名称及其解释。
链接：https://arxiv.org/help/api/user-manual 的 5.3 小节的 Subject Classifications 的部分，或 https://arxiv.org/category_taxonomy，具体的153种paper的类别部分如下：

'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',

代码实现及官方讲解

导入需要的包

#导入所需的package
import seaborn as sns#用于做图
from bs4 import BeautifulSoup #用于爬去arXiv的数据
import re #用于正则表达式，匹配字符串的模型
import requests #用于网络连接，发生网络请求，使用域名获取对应信息
import json #读取数据，我们的数据为json格式
import pandas as pd#用于数据分析
import matplotlib.pyplot as plt#画图工具

#读入数据
data = [] #初始化
#使用with语句的优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open("arxiv-metadata-oai-2019.json",'r') as f:
    for line in f:
        data.append(json.loads(line))

data = pd.DataFrame(data) #将list转变为DataFrame格式，方便使用pandas进行分析

Json函数

函数	描述
json.dumps	将Python对象编码成Json字符串
json.dump
json.loads	将已编码的Json字符串解码为Python对象

json.dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding="utf-8", default=None, sort_keys=False, **kw)

json.dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding="utf-8", default=None, sort_keys=False, **kw)

json.load(fp[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])

json.loads(s[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])
#上面的dumps和loads方法都在内存中转换，下面的dump和load的方法会多一个步骤，dump是把序列化后的字符串写到一个文件中，而
#load是从一个一个文件中读取文件

#然后来介绍dump方法
# import json
# d1 = {'name':'foot'}
#这一步就会把d1做序列化处理后的字符串写到db这个文件中

# json.dump(d1,open('db','w'))
# d1 = json.load(open('db','r'))
# print(d1,type(d1))

# {'name': 'foot'} <class 'dict'>

Json类型转换到Python的类型对照表

Json	python
object	dict
array	list
string	unicode
number(int)	int,long
number(real)	float
true	True
false	False
null	None

Python对象类型转化为Json类型对照表

Python	Json
dict	object
list,tuple	array
str,unicode	string
int,long,float	number
True	true
False	false
None	null

data.shape#显示数据大小

(170618, 14)

data.head()

	id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	versions	update_date	authors_parsed
0	0704.0297	Sung-Chul Yoon	Sung-Chul Yoon, Philipp Podsiadlowski and Step...	Remnant evolution after a carbon-oxygen white ...	15 pages, 15 figures, 3 tables, submitted to M...	None	10.1111/j.1365-2966.2007.12161.x	None	astro-ph	None	We systematically explore the evolution of t...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	2019-08-19	[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
1	0704.0342	Patrice Ntumba Pungu	B. Dugmore and PP. Ntumba	Cofibrations in the Category of Frolicher Spac...	27 pages	None	None	None	math.AT	None	Cofibrations are defined in the category of ...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	2019-08-19	[[Dugmore, B., ], [Ntumba, PP., ]]
2	0704.0360	Zaqarashvili	T.V. Zaqarashvili and K Murawski	Torsional oscillations of longitudinally inhom...	6 pages, 3 figures, accepted in A&A	None	10.1051/0004-6361:20077246	None	astro-ph	None	We explore the effect of an inhomogeneous ma...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	2019-08-19	[[Zaqarashvili, T. V., ], [Murawski, K, ]]
3	0704.0525	Sezgin Ayg\"un	Sezgin Aygun, Ismail Tarhan, Husnu Baysal	On the Energy-Momentum Problem in Static Einst...	This submission has been withdrawn by arXiv ad...	Chin.Phys.Lett.24:355-358,2007	10.1088/0256-307X/24/2/015	None	gr-qc	None	This paper has been removed by arXiv adminis...	[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...	2019-10-21	[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
4	0704.0535	Antonio Pipino	Antonio Pipino (1,3), Thomas H. Puzia (2,4), a...	The Formation of Globular Cluster Systems in M...	32 pages (referee format), 9 figures, ApJ acce...	Astrophys.J.665:295-305,2007	10.1086/519546	None	astro-ph	None	The most massive elliptical galaxies show a ...	[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...	2019-08-19	[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...

数据预处理

首先粗略统计论文的种类信息：

count:一列数据的元素个数
unique:一列数据中元素的种类
top:一列数据中出现频率最高的元素
freq:一列数据中出现频率最高的元素的个数

data['categories'].describe()

count     170618
unique     15592
top        cs.CV
freq        5559
Name: categories, dtype: object

结果表明：共有170618条数据，有15592个子类，其中最多的类型是cs.CV,共出现了5559次。

在这里插入图片描述

如上图所示，部分论文的类别不止一种，因为要判断在本数据集中出现了多少种独立的数据集。

unique_categories = set([i for l in [x.split(' ') for x in data['categories']] for i in l])
print(len(unique_categories))
unique_categories

172





{'acc-phys',
 'adap-org',
 'alg-geom',
 'astro-ph',
 'astro-ph.CO',
 'astro-ph.EP',
 'astro-ph.GA',
 'astro-ph.HE',
 'astro-ph.IM',
 'astro-ph.SR',
 'chao-dyn',
 'chem-ph',
 'cmp-lg',
 'comp-gas',
 'cond-mat',
 'cond-mat.dis-nn',
 'cond-mat.mes-hall',
 'cond-mat.mtrl-sci',
 'cond-mat.other',
 'cond-mat.quant-gas',
 'cond-mat.soft',
 'cond-mat.stat-mech',
 'cond-mat.str-el',
 'cond-mat.supr-con',
 'cs.AI',
 'cs.AR',
 'cs.CC',
 'cs.CE',
 'cs.CG',
 'cs.CL',
 'cs.CR',
 'cs.CV',
 'cs.CY',
 'cs.DB',
 'cs.DC',
 'cs.DL',
 'cs.DM',
 'cs.DS',
 'cs.ET',
 'cs.FL',
 'cs.GL',
 'cs.GR',
 'cs.GT',
 'cs.HC',
 'cs.IR',
 'cs.IT',
 'cs.LG',
 'cs.LO',
 'cs.MA',
 'cs.MM',
 'cs.MS',
 'cs.NA',
 'cs.NE',
 'cs.NI',
 'cs.OH',
 'cs.OS',
 'cs.PF',
 'cs.PL',
 'cs.RO',
 'cs.SC',
 'cs.SD',
 'cs.SE',
 'cs.SI',
 'cs.SY',
 'dg-ga',
 'econ.EM',
 'econ.GN',
 'econ.TH',
 'eess.AS',
 'eess.IV',
 'eess.SP',
 'eess.SY',
 'funct-an',
 'gr-qc',
 'hep-ex',
 'hep-lat',
 'hep-ph',
 'hep-th',
 'math-ph',
 'math.AC',
 'math.AG',
 'math.AP',
 'math.AT',
 'math.CA',
 'math.CO',
 'math.CT',
 'math.CV',
 'math.DG',
 'math.DS',
 'math.FA',
 'math.GM',
 'math.GN',
 'math.GR',
 'math.GT',
 'math.HO',
 'math.IT',
 'math.KT',
 'math.LO',
 'math.MG',
 'math.MP',
 'math.NA',
 'math.NT',
 'math.OA',
 'math.OC',
 'math.PR',
 'math.QA',
 'math.RA',
 'math.RT',
 'math.SG',
 'math.SP',
 'math.ST',
 'mtrl-th',
 'nlin.AO',
 'nlin.CD',
 'nlin.CG',
 'nlin.PS',
 'nlin.SI',
 'nucl-ex',
 'nucl-th',
 'patt-sol',
 'physics.acc-ph',
 'physics.ao-ph',
 'physics.app-ph',
 'physics.atm-clus',
 'physics.atom-ph',
 'physics.bio-ph',
 'physics.chem-ph',
 'physics.class-ph',
 'physics.comp-ph',
 'physics.data-an',
 'physics.ed-ph',
 'physics.flu-dyn',
 'physics.gen-ph',
 'physics.geo-ph',
 'physics.hist-ph',
 'physics.ins-det',
 'physics.med-ph',
 'physics.optics',
 'physics.plasm-ph',
 'physics.pop-ph',
 'physics.soc-ph',
 'physics.space-ph',
 'q-alg',
 'q-bio',
 'q-bio.BM',
 'q-bio.CB',
 'q-bio.GN',
 'q-bio.MN',
 'q-bio.NC',
 'q-bio.OT',
 'q-bio.PE',
 'q-bio.QM',
 'q-bio.SC',
 'q-bio.TO',
 'q-fin.CP',
 'q-fin.EC',
 'q-fin.GN',
 'q-fin.MF',
 'q-fin.PM',
 'q-fin.PR',
 'q-fin.RM',
 'q-fin.ST',
 'q-fin.TR',
 'quant-ph',
 'solv-int',
 'stat.AP',
 'stat.CO',
 'stat.ME',
 'stat.ML',
 'stat.OT',
 'stat.TH',
 'supr-con'}

共有172中论文类别

[i for l in [x.split(' ') for x in data['categories']] for i in l]

上述列表解析式值得注意，在列表解析式中循环的执行会有先后顺序，即按照for出现的先后顺序执行

任务是对2019年以后的论文进行分析，所以首先要对时间特征进行预处理，从而得到2019年以后所有种类的论文。

data.columns

Index(['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],
      dtype='object')

data数据中update_date列看起来就是个时间数据，因此对其进行处理

data['year'] = pd.to_datetime(data['update_date']).dt.year
#将update_date转换为datatime格式，并提取year生成新的year列
data_2019 = data[data['year']>=2019].reset_index()
data_2019.head()

	index	id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	versions	update_date	authors_parsed	year
0	0	0704.0297	Sung-Chul Yoon	Sung-Chul Yoon, Philipp Podsiadlowski and Step...	Remnant evolution after a carbon-oxygen white ...	15 pages, 15 figures, 3 tables, submitted to M...	None	10.1111/j.1365-2966.2007.12161.x	None	astro-ph	None	We systematically explore the evolution of t...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	2019-08-19	[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...	2019
1	1	0704.0342	Patrice Ntumba Pungu	B. Dugmore and PP. Ntumba	Cofibrations in the Category of Frolicher Spac...	27 pages	None	None	None	math.AT	None	Cofibrations are defined in the category of ...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	2019-08-19	[[Dugmore, B., ], [Ntumba, PP., ]]	2019
2	2	0704.0360	Zaqarashvili	T.V. Zaqarashvili and K Murawski	Torsional oscillations of longitudinally inhom...	6 pages, 3 figures, accepted in A&A	None	10.1051/0004-6361:20077246	None	astro-ph	None	We explore the effect of an inhomogeneous ma...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	2019-08-19	[[Zaqarashvili, T. V., ], [Murawski, K, ]]	2019
3	3	0704.0525	Sezgin Ayg\"un	Sezgin Aygun, Ismail Tarhan, Husnu Baysal	On the Energy-Momentum Problem in Static Einst...	This submission has been withdrawn by arXiv ad...	Chin.Phys.Lett.24:355-358,2007	10.1088/0256-307X/24/2/015	None	gr-qc	None	This paper has been removed by arXiv adminis...	[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...	2019-10-21	[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...	2019
4	4	0704.0535	Antonio Pipino	Antonio Pipino (1,3), Thomas H. Puzia (2,4), a...	The Formation of Globular Cluster Systems in M...	32 pages (referee format), 9 figures, ApJ acce...	Astrophys.J.665:295-305,2007	10.1086/519546	None	astro-ph	None	The most massive elliptical galaxies show a ...	[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...	2019-08-19	[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...	2019

得到了所有2019年以后提交的论文，接下来就是挑选出计算机领域内的所有文章：

website_url = requests.get('https://arxiv.org/category_taxonomy').text
#获取网页的文本数据
soup = BeautifulSoup(website_url,'lxml')#爬取数据，使用lxml解析器
root = soup.find('div',{'id':'category_taxonomy_list'})
#找出BeautifulSoup对应的标签入口
tags = root.find_all(['h2','h3','h4','p'],recursive=True)

爬虫分析过程图片
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7t9H5mIK-1610355533886)(./爬虫1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-e85i1JMC-1610355533891)(./爬虫2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7drcNoc5-1610355533900)(./爬虫2-1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7DxTdBzJ-1610355533903)(./爬虫3.png)]

#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []
for t in tags:
    if t.name == "h2":#t.name指标签</>的内容即‘h2’、‘h3’等
        #h2标签为<h2 class="accordion-head">Mathematics</h2>，我们只需要获取“Mathematics”这个文本内容
        level_1_name = t.text#t.text为去掉</>标签后的文本内容 
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text#<h3>Quantum Physics<br/><span>(quant-ph)</span></h3>,t.text:Quantum Physics(quant-ph)'
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式：模式字符串：(.*)\((.*)\)；被替换字符串"\2"；被处理字符串：raw
        #"(.*)\((.*)\)"匹配第一个括号前的内容和第一个括号内的内容，r"\2"表示获取匹配第二个(.*)的内容
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text#h4：<h4>stat.TH <span>(Statistics Theory)</span></h4>，h4.text：'stat.TH (Statistics Theory)'
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        #</p><p>stat.TH is an alias for math.ST. Asymptotics, Bayesian Inference, Decision Theory, Estimation, Foundations, Inference, Testing.</p>
        level_1_names.append(level_1_name)#在上面判断h2、h3、h4时已经赋值
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)

#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组，在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy

	group_name	archive_name	archive_id	category_name	categories	category_description
0	Computer Science	Computer Science	Computer Science	Artificial Intelligence	cs.AI	Covers all areas of AI except Vision, Robotics...
1	Computer Science	Computer Science	Computer Science	Hardware Architecture	cs.AR	Covers systems organization and hardware archi...
2	Computer Science	Computer Science	Computer Science	Computational Complexity	cs.CC	Covers models of computation, complexity class...
3	Computer Science	Computer Science	Computer Science	Computational Engineering, Finance, and Science	cs.CE	Covers applications of computer science to the...
4	Computer Science	Computer Science	Computer Science	Computational Geometry	cs.CG	Roughly includes material in ACM Subject Class...
...	...	...	...	...	...	...
150	Statistics	Statistics	Statistics	Computation	stat.CO	Algorithms, Simulation, Visualization
151	Statistics	Statistics	Statistics	Methodology	stat.ME	Design, Surveys, Model Selection, Multiple Tes...
152	Statistics	Statistics	Statistics	Machine Learning	stat.ML	Covers machine learning papers (supervised, un...
153	Statistics	Statistics	Statistics	Other Statistics	stat.OT	Work in statistics that does not fit into the ...
154	Statistics	Statistics	Statistics	Statistics Theory	stat.TH	stat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

这里主要说明一下代码中的正则表达式

Signature: re.sub(pattern, repl, string, count=0, flags=0)
Docstring:
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the Match object and must return
a replacement string to be used.

返回通过替换最左边获得的字符串
字符串中模式的非重叠出现
更换代表repl可以是字符串，也可以是可调用的；
如果是字符串，则处理其中的反斜杠转义。如果是
一个可调用对象，它已传递Match对象，并且必须返回
要使用的替换字符串。

pattern : 正则中的模式字符串。
repl : 替换的字符串，也可为一个函数。
string : 要被查找替换的原始字符串。
count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。
flags : 编译时用的匹配模式，数字形式。
其中pattern、repl、string为必选参数

import re
phone = "2004-959-559 #一个电话号码"
#删除注释
num = re.sub(r"#.*$","",phone)
print("电话号码：",num)
#移除非数字的内容
num = re.sub(r'\D','',phone)
print("电话号码:",num)

电话号码： 2004-959-559 
电话号码: 2004959559

数据分析及可视化

接下来我们看一下所有大类的paper数量分布

_df = data_2019.merge(df_taxonomy,on='categories',how='left').drop_duplicates(['id','group_name']).groupby('group_name').agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
#groupby('group_name').agg({"id":"count"})等价于.groupby('group_name').count()[['id']]
_df

	group_name	id
0	Physics	38379
1	Mathematics	24495
2	Computer Science	18087
3	Statistics	1802
4	Electrical Engineering and Systems Science	1371
5	Quantitative Biology	886
6	Quantitative Finance	352
7	Economics	173

fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) 
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()

plt.savefig("./各类论文分布图.png")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ms52m8dE-1610355533917)(output_28_0.svg)]

下面统计在计算机各个子领域2019年后的paper数量：

group_name="Computer Science"
cats = data_2019.merge(df_taxonomy, on="categories").query("group_name == @group_name")#相当于sql select d1.*,d2.* from data_2019 d1 join df_taxonomy d2 on d1.categories=d2.categories where group_name="Computer Science"
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")

year	2019
category_name
Artificial Intelligence	558
Computation and Language	2153
Computational Complexity	131
Computational Engineering, Finance, and Science	108
Computational Geometry	199
Computer Science and Game Theory	281
Computer Vision and Pattern Recognition	5559
Computers and Society	346
Cryptography and Security	1067
Data Structures and Algorithms	711
Databases	282
Digital Libraries	125
Discrete Mathematics	84
Distributed, Parallel, and Cluster Computing	715
Emerging Technologies	101
Formal Languages and Automata Theory	152
General Literature	5
Graphics	116
Hardware Architecture	95
Human-Computer Interaction	420
Information Retrieval	245
Logic in Computer Science	470
Machine Learning	177
Mathematical Software	27
Multiagent Systems	85
Multimedia	76
Networking and Internet Architecture	864
Neural and Evolutionary Computing	235
Numerical Analysis	40
Operating Systems	36
Other Computer Science	67
Performance	45
Programming Languages	268
Robotics	917
Social and Information Networks	202
Software Engineering	659
Sound	7
Symbolic Computation	44
Systems and Control	415

我们可以从结果看出，Computer Vision and Pattern Recognition（计算机视觉与模式识别）类是CS中paper数量最多的子类，遥遥领先于其他的CS子类，并且paper的数量还在逐年增加；另外，Computation and Language（计算与语言）、Cryptography and Security（密码学与安全）以及 Robotics（机器人学）的2019年paper数量均超过1000或接近1000，这与我们的认知是一致的。

miaochangq

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
学术前沿趋势分析

学术前沿趋势分析任务一，论文数据统计任务说明数据集介绍arXiv论文类别介绍代码实现及官方讲解导入需要的包Json函数Json类型转换到Python的类型对照表Python对象类型转化为Json类型对照表数据预处理数据分析及可视化任务一，论文数据统计任务说明任务主题：统计2019年全年计算机各个方向论文数量任务内容：赛题理解、使用Pandas读取数据并进行统计数据集介绍数据来源：[数据集地址]https://www.kaggle.com/Cornell-University/arxiv
复制链接

扫一扫