数据分析入门（学术前沿趋势分析）Task1-论文数据统计

最新推荐文章于 2023-01-04 11:00:53 发布

xyc_undermoon

最新推荐文章于 2023-01-04 11:00:53 发布

阅读量1.2k

点赞数

分类专栏：数据分析入门笔记初学AI 文章标签：数据分析 python

本文链接：https://blog.csdn.net/xyc_undermoon/article/details/112552113

版权

初学AI 同时被 2 个专栏收录

11 篇文章 2 订阅

订阅专栏

数据分析入门笔记

5 篇文章 0 订阅

订阅专栏

此次赛题是零基础入门数据分析（学术前沿趋势分析），使用公开的arXiv论文完成对应的数据分析操作。赛题内容包括对论文数量、作者出现频率、论文源码的统计，对论文进行分类以及对论文作者的关系进行建模。

Ⅰ、数据及背景

主题：统计论文数量
内容：理解赛题、学习利用 Pandas 读取数据并进行统计
数据集：arXiv 重要的学术公开网站，也是搜索、浏览和下载学术论文的重要工具。arXiv论文涵盖的范围非常广，涉及物理学的庞大分支和计算机科学的众多子学科，如数学、统计学、电气工程、定量生物学和经济学等等。

Ⅱ、数据集介绍

数据集链接

数据集详细介绍链接

数据集部分介绍如下：

数据集格式：

id	arXiv，可用于访问论文
submitter	论文提交者
authors	论文作者
title	论文标题
comments	论文页数和图表等其他信息
journal-ref	论文发表的期刊的信息
doi	数字对象标识符
report-no	报告编号
categories	论文在arXiv系统的所属类别或标签
license	文章的许可证
abstract	论文摘要
versions	论文版本
authors_parsed	作者的信息

论文部分类别

astro-ph	天体物理学(Astrophysics)
astro-ph.CO	宇宙学与非银河系天体物理学(Cosmology and Nongalactic Astrophysics)
astro-ph.EP	地球与行星天体物理学(Earth and Planetary Astrophysics)
astro-ph.GA	星系(Astrophysics of Galaxies)
cs.AI	Artificial Intelligence
cs.AR	硬件体系结构(Hardware Architecture)
cs.CC	计算复杂度(Computational Complexity)
cs.CE	计算工程、金融与科学(Computational Engineering, Finance, and Science)
cs.CV	计算机视觉与模式识别(Computer Vision and Pattern Recognition)
cs.CY	计算机与社会(Computers and Society)
cs.DB	数据库(Databases)
cs.DC	分布式、并行与集群计算(Distributed, Parallel, and Cluster Computing)
cs.DL	数字图书馆(Digital Libraries)
cs.NA	数值分析(Numerical Analysis)
cs.NE	神经进化计算(Neural and Evolutionary Computing)
cs.NI	网络与互联网架构(Networking and Internet Architecture)
cs.OH	其他计算机科学(Other Computer Science)
cs.OS	操作系统(Operating Systems)

Ⅲ、代码实现

Ⅰ、环境准备

所需要的模块包括：seaborn、BeautifulSoup、requests、json、pandas、matplotlib。

Ⅱ、数据预处理

读取数据：

import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式，匹配字符串的模式
import requests #用于网络连接，发送网络请求，使用域名获取对应信息
import json #读取数据，我们的数据为json格式的
import pandas as pd #数据处理，数据分析
import matplotlib.pyplot as plt #画图工具

# 数据导入
data  = [] #初始化
#使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open(r"D:/xyc/competPractice/dataAnalysis2101/archive/arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for line in f: 
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析
print(data.shape) #显示数据大小
data.head() #显示数据的前五行

结果如下：

(1796911, 14)

首先粗略统计论文种类信息：

'''
count：一列数据的元素个数；
unique：一列数据中元素的种类；
top：一列数据中出现频率最高的元素；
freq：一列数据中出现频率最高的元素的个数；
'''

data["categories"].describe()

结果如下：

count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

以上结果表明数据集中共有1796911篇论文，分为62055个种类，论文数量最多的种类是天体物理学(astro-ph)，该种类一共出现了86914次。

因为很多论文的类别实际上不止一种，所以仅有上面的统计远远不够，下面统计本数据集中共出现了多少种独立的数据集：

# 所有的种类(独立的)

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
print(len(unique_categories))
print(unique_categories)

相互独立的种类详细统计结果

以上结果表明共有176种独立论文种类，比官网上给出的类别数量更多，说明官网上有部分类别并未统计。不过从结果可以看出，计算机方向的论文种类并未改变，仍旧是官方统计的40种。

本次赛题是前沿学术分析，因此选择近两年的论文数据进行分析：

# 2019年后的数据
data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式，并提取处year
del data["update_date"] #删除 update_date特征，其使命已完成
# 2019
data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据
# data.groupby(['categories','year']) #以 categories 进行排序，如果同一个categories 相同则使用 year 特征进行排序
data.groupby(['categories', 'year'])

# data.reset_index(drop=True, inplace=True) #重新编号
data.reset_index(drop=True, inplace=True) #重新编号

# data #查看结果
print("19年以后的数据：")
data

19年以后的数据：

id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	versions	authors_parsed	year
0	0704.0297	Sung-Chul Yoon	Sung-Chul Yoon, Philipp Podsiadlowski and Step...	Remnant evolution after a carbon-oxygen white ...	15 pages, 15 figures, 3 tables, submitted to M...	None	10.1111/j.1365-2966.2007.12161.x	None	astro-ph	None	We systematically explore the evolution of t...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...	2019
1	0704.0342	Patrice Ntumba Pungu	B. Dugmore and PP. Ntumba	Cofibrations in the Category of Frolicher Spac...	27 pages	None	None	None	math.AT	None	Cofibrations are defined in the category of ...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	[[Dugmore, B., ], [Ntumba, PP., ]]	2019
2	0704.0360	Zaqarashvili	T.V. Zaqarashvili and K Murawski	Torsional oscillations of longitudinally inhom...	6 pages, 3 figures, accepted in A&A	None	10.1051/0004-6361:20077246	None	astro-ph	None	We explore the effect of an inhomogeneous ma...	[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...	[[Zaqarashvili, T. V., ], [Murawski, K, ]]	2019
3	0704.0525	Sezgin Ayg\"un	Sezgin Aygun, Ismail Tarhan, Husnu Baysal	On the Energy-Momentum Problem in Static Einst...	This submission has been withdrawn by arXiv ad...	Chin.Phys.Lett.24:355-358,2007	10.1088/0256-307X/24/2/015	None	gr-qc	None	This paper has been removed by arXiv adminis...	[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...	[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...	2019
4	0704.0535	Antonio Pipino	Antonio Pipino (1,3), Thomas H. Puzia (2,4), a...	The Formation of Globular Cluster Systems in M...	32 pages (referee format), 9 figures, ApJ acce...	Astrophys.J.665:295-305,2007	10.1086/519546	None	astro-ph	None	The most massive elliptical galaxies show a ...	[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...	[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...	2019
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
395118	quant-ph/9911051	Stephen A. Fulling	S. A. Fulling	Large Numbers, the Chinese Remainder Theorem, ...	9 pages; Plain TeX with vanilla.sty and pictex...	Phys. Rev. Applied 13, 024016 (2020)	10.1103/PhysRevApplied.13.024016	None	quant-ph	None	This is a pedagogical article cited in the f...	[{'version': 'v1', 'created': 'Thu, 11 Nov 199...	[[Fulling, S. A., ]]	2020
395119	solv-int/9511005	Wen-Xiu Ma	Wen-Xiu Ma, Benno Fuchssteiner	Explicit and Exact Solutions to a Kolmogorov-P...	14pages, Latex, to appear in Intern. J. Nonlin...	None	10.1016/0020-7462(95)00064-X	None	solv-int nlin.SI	None	Some explicit traveling wave solutions to a ...	[{'version': 'v1', 'created': 'Tue, 14 Nov 199...	[[Ma, Wen-Xiu, ], [Fuchssteiner, Benno, ]]	2019
395120	solv-int/9809008	Victor Enolskii	J C Eilbeck, V Z Enol'skii, V B Kuznetsov, D V...	Linear r-Matrix Algebra for a Hierarchy of One...	plain LaTeX, 28 pages	None	None	None	solv-int nlin.SI	None	We consider a hierarchy of many-particle sys...	[{'version': 'v1', 'created': 'Wed, 2 Sep 1998...	[[Eilbeck, J C, ], [Enol'skii, V Z, ], [Kuznet...	2019
395121	solv-int/9909010	Pierre van Moerbeke	M. Adler, T. Shiota and P. van Moerbeke	Pfaff tau-functions	42 pages	None	None	None	solv-int adap-org hep-th nlin.AO nlin.SI	None	Consider the evolution $$ \frac{\pl m_\iy}{\...	[{'version': 'v1', 'created': 'Wed, 15 Sep 199...	[[Adler, M., ], [Shiota, T., ], [van Moerbeke,...	2019
395122	solv-int/9909014	David Fairlie	D.B. Fairlie and A.N. Leznov	The General Solution of the Complex Monge-Amp\...	13 pages, latex, no figures	None	10.1088/0305-4470/33/25/307	None	solv-int nlin.SI	None	A general solution to the Complex Monge-Amp\...	[{'version': 'v1', 'created': 'Thu, 16 Sep 199...	[[Fairlie, D. B., ], [Leznov, A. N., ]]	2019

395123 rows × 14 columns

# 2020年后的数据
data20 = data[data["year"] >= 2020] #找出 year 中2020年以后的数据
data20.groupby(['categories','year']) #以 categories 进行排序，如果同一个categories 相同则使用 year 特征进行排序
data20.reset_index(drop=True, inplace=True) #重新编号
print("20年以后的数据：")
data20

20年以后的数据：

id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	versions	authors_parsed	year
0	0704.0752	Davoud Kamani	Davoud Kamani	Actions for the Bosonic String with the Curved...	8 pages, Latex, no figure, Some minor changes ...	Braz. J. Phys. 38, 268-271 (2008)	10.1590/S0103-97332008000200010	None	hep-th	None	At first we introduce an action for the stri...	[{'version': 'v1', 'created': 'Thu, 5 Apr 2007...	[[Kamani, Davoud, ]]	2020
1	0704.0880	Qiuping A. Wang	Q. A. Wang (ISMANS), F. Tsobnang (ISMANS), S. ...	Stochastic action principle and maximum entropy	This work is a further development of the idea...	Chaos, Solitons and Fractals, 40(2009)2550-2556	None	None	cond-mat.stat-mech	None	A stochastic action principle for stochastic...	[{'version': 'v1', 'created': 'Fri, 6 Apr 2007...	[[Wang, Q. A., , ISMANS], [Tsobnang, F., , ISM...	2020
2	0704.1403	Alberto S. Cattaneo	Alberto S. Cattaneo, Florian Schaetz	Equivalences of Higher Derived Brackets	16 pages; minor changes; corrected typos; to a...	J. Pure Appl. Algebra, 212, 2450-2460 (2008)	10.1016/j.jpaa.2008.03.013	None	math.QA math.DG math.SG	None	This note elaborates on Th. Voronov's constr...	[{'version': 'v1', 'created': 'Wed, 11 Apr 200...	[[Cattaneo, Alberto S., ], [Schaetz, Florian, ]]	2020
3	0704.2498	Daniel H. Lenz	Daniel Lenz, Nicolae Strungaru	Pure Point spectrum for measure dynamical syst...	22 pages	Journal de Math\'ematiques Pures et Appliqu\'e...	10.1016/j.matpur.2009.05.013	None	math-ph math.MP	http://arxiv.org/licenses/nonexclusive-distrib...	We show equivalence of pure point diffractio...	[{'version': 'v1', 'created': 'Thu, 19 Apr 200...	[[Lenz, Daniel, ], [Strungaru, Nicolae, ]]	2020
4	0704.2967	Serhiy Samokhvalov E.	Serhiy E. Samokhvalov	Group-theoretic Description of Riemannian Spaces	14 pages	Ukrainian Math. J., v.55 (2003), 1238-1248	10.1023/B:UKMA.0000018010.14309.76	None	math.DG math.GR	None	It is shown that a locally geometrical struc...	[{'version': 'v1', 'created': 'Mon, 23 Apr 200...	[[Samokhvalov, Serhiy E., ]]	2020
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
224500	quant-ph/9802022	Masanao Ozawa	Masanao Ozawa (Nagoya University)	On the Concept of Quantum State Reduction: Inc...	15 pages, LaTeX, 1 Postscript figure	Annals of the Japan Association for Philosophy...	10.4288/jafpos1956.11.107	None	quant-ph	None	The argument is re-examined that the program...	[{'version': 'v1', 'created': 'Mon, 9 Feb 1998...	[[Ozawa, Masanao, , Nagoya University]]	2020
224501	quant-ph/9806088	Jens Eisert	J. Eisert, M. Wilkens, and M. Lewenstein	Quantum Games and Quantum Strategies	4 pages, 4 figures, typographic sign error in ...	Phys. Rev. Lett. 83, 3077 (1999)	10.1103/PhysRevLett.83.3077	None	quant-ph	http://arxiv.org/licenses/nonexclusive-distrib...	We investigate the quantization of non-zero ...	[{'version': 'v1', 'created': 'Fri, 26 Jun 199...	[[Eisert, J., ], [Wilkens, M., ], [Lewenstein,...	2020
224502	quant-ph/9807034	Jens Eisert	J. Eisert (U. Potsdam, Germany), M. B. Plenio ...	A comparison of entanglement measures	6 pages (RevTeX), 4 figures	J. Mod. Opt. 46, 145 (1999)	10.1080/09500349908231260	J. Mod. Opt. 46, 145-154 (1999)	quant-ph	None	We compare the entanglement of formation wit...	[{'version': 'v1', 'created': 'Mon, 13 Jul 199...	[[Eisert, J., , U. Potsdam, Germany], [Plenio,...	2020
224503	quant-ph/9910035	Pavel Exner	P.Duclos, P.Exner, and D. Krejcirik	Locally curved quantum layers	LaTeX2e, 15 pages, to appear in the Ukrainian ...	Ukrainian J. Phys. 45 (2000), 595-601	None	None	quant-ph cond-mat math-ph math.MP	None	We consider a quantum particle constrained t...	[{'version': 'v1', 'created': 'Fri, 8 Oct 1999...	[[Duclos, P., ], [Exner, P., ], [Krejcirik, D....	2020
224504	quant-ph/9911051	Stephen A. Fulling	S. A. Fulling	Large Numbers, the Chinese Remainder Theorem, ...	9 pages; Plain TeX with vanilla.sty and pictex...	Phys. Rev. Applied 13, 024016 (2020)	10.1103/PhysRevApplied.13.024016	None	quant-ph	None	This is a pedagogical article cited in the f...	[{'version': 'v1', 'created': 'Thu, 11 Nov 199...	[[Fulling, S. A., ]]	2020

224505 rows × 14 columns

我们得到了2019年以后及2020年以后的所有论文数据，然后分别从中挑选出计算机领域的文章（来源：arXiv Category Taxonomy）：

# 挑选出计算机领域的统计数据
from numpy import random
#爬取所有的类别
user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]
headers = {
    'User-Agent':'Mozilla/5.0',
    'Content-Type':'application/json',
    'method':'GET',
    'Accept':'application/vnd.github.cloak-preview'
}
headers['User-Agent'] = random.choice(user_agent_list) # 伪装浏览器头部
website_url = requests.get('https://arxiv.org/category_taxonomy', headers = headers, verify = False).text #获取网页的文本数据
soup = BeautifulSoup(website_url,'html.parser') #爬取数据
root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags

#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

#进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式：模式字符串：(.*)\((.*)\)；被替换字符串"\2"；被处理字符串：raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)

#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组，在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy

	group_name	archive_name	archive_id	category_name	categories	category_description
0	Computer Science	Computer Science	Computer Science	Artificial Intelligence	cs.AI	Covers all areas of AI except Vision, Robotics...
1	Computer Science	Computer Science	Computer Science	Hardware Architecture	cs.AR	Covers systems organization and hardware archi...
2	Computer Science	Computer Science	Computer Science	Computational Complexity	cs.CC	Covers models of computation, complexity class...
3	Computer Science	Computer Science	Computer Science	Computational Engineering, Finance, and Science	cs.CE	Covers applications of computer science to the...
4	Computer Science	Computer Science	Computer Science	Computational Geometry	cs.CG	Roughly includes material in ACM Subject Class...
...	...	...	...	...	...	...
150	Statistics	Statistics	Statistics	Computation	stat.CO	Algorithms, Simulation, Visualization
151	Statistics	Statistics	Statistics	Methodology	stat.ME	Design, Surveys, Model Selection, Multiple Tes...
152	Statistics	Statistics	Statistics	Machine Learning	stat.ML	Covers machine learning papers (supervised, un...
153	Statistics	Statistics	Statistics	Other Statistics	stat.OT	Work in statistics that does not fit into the ...
154	Statistics	Statistics	Statistics	Statistics Theory	stat.TH	stat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

【注】此处记录出现过的一些问题及我的解决办法

问题1：最初的代码是这样的
soup = BeautifulSoup(website_url,'lxml') #爬取数据，这里使用lxml的解析器，加速
出现错误

我的解决方法是改成下面的
soup = BeautifulSoup(website_url,'html.parser') #爬取数据
原因是有说法是bs4.0以后不用lxml，这里参考过的文章：bs4 FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

问题2：出现如下错误
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='baike.baidu.com', port=443):
 Max retries exceeded with url: https://baike.baidu.com/item/%E5%88%98%E5%BE%B7%E5%8D%8E/114923
 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb51433af98>:
 Failed to establish a new connection: [Errno -2] Name or service not known',))
这个错误经常出现，根据我出错的经验，这个错误可能是由于网络不稳定或是因访问频繁而导致IP被封，在爬虫中经常遇到。我的解决办法是添加如下“伪装头”：
user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]

headers = {
    'User-Agent':'Mozilla/5.0',
    'Content-Type':'application/json',
    'method':'GET',
    'Accept':'application/vnd.github.cloak-preview'
}

headers['User-Agent'] = random.choice(user_agent_list)
具体参考文章链接：

requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c

python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded

Ⅲ、数据分析及可视化

首先查看一下所有大类的paper数量分布：

# 所有大类的paper数量分布(2019后)
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df

通过共同属性 “categories” 进行两表的合并，以 “group_name” 作为类别进行统计，统计结果放入 “id” 列中并排序，统计结果如下：

	group_name	id
0	Physics	79985
1	Mathematics	51567
2	Computer Science	40067
3	Statistics	4054
4	Electrical Engineering and Systems Science	3297
5	Quantitative Biology	1994
6	Quantitative Finance	826
7	Economics	576

同理，2020年后的数据如下：

# 所有大类的paper数量分布(2020后)
_df20 = data[data['year']>=2020].merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df20

	group_name	id
0	Physics	41606
1	Mathematics	27072
2	Computer Science	21980
3	Statistics	2252
4	Electrical Engineering and Systems Science	1926
5	Quantitative Biology	1108
6	Quantitative Finance	474
7	Economics	403

使用饼图进行可视化：

# 饼图可视化(2019后)
fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) 
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

2019年后各学科研究量占比图

同理可画出2020年后的分布饼图：

2020年后各学科研究量占比图

接下来统计在计算机各个子领域2019年与2020年后的paper数量：

# 计算机各个子领域2019年与2020年的paper数量
group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")

year category_name	2019	2020
year category_name	2019	2020	Artificial Intelligence	558	757
Computation and Language	2153	2906
Computational Complexity	131	188
Computational Engineering, Finance, and Science	108	205
Computational Geometry	199	216
Computer Science and Game Theory	281	323
Computer Vision and Pattern Recognition	5559	6517
Computers and Society	346	564
Cryptography and Security	1067	1238
Data Structures and Algorithms	711	902
Databases	282	342
Digital Libraries	125	157
Discrete Mathematics	84	81
Distributed, Parallel, and Cluster Computing	715	774
Emerging Technologies	101	84
Formal Languages and Automata Theory	152	137
General Literature	5	5
Graphics	116	151
Hardware Architecture	95	159
Human-Computer Interaction	420	580
Information Retrieval	245	331
Logic in Computer Science	470	504
Machine Learning	177	538
Mathematical Software	27	45
Multiagent Systems	85	90
Multimedia	76	66
Networking and Internet Architecture	864	783
Neural and Evolutionary Computing	235	279
Numerical Analysis	40	11
Operating Systems	36	33
Other Computer Science	67	69
Performance	45	51
Programming Languages	268	294
Robotics	917	1298
Social and Information Networks	202	325
Software Engineering	659	804
Sound	7	4
Symbolic Computation	44	36
Systems and Control	415	133

可以看出，2020年大部分种类论文数量都大于2019年，同时两年最高产的论文类型都是计算机视觉与模式识别(Computer Vision and Pattern Recognition)，并且数量远大于其他类型，看来CV和PR仍然是当前学术研究的主流方向。

Task2链接——数据分析入门（学术前沿趋势分析）Task2-论文作者统计

Ⅳ、参考资料

Datawhale数据分析训练营学习手册(学术前沿趋势分析)——Task1：论文数据统计

Pandas DataFrame的基本属性详解

Pandas入门（二）——DataFrame结构及常用操作

bs4 FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c

python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded

xyc_undermoon

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
数据分析入门（学术前沿趋势分析）Task1-论文数据统计

此次赛题是零基础入门数据分析（学术前沿趋势分析），使用公开的arXiv论文完成对应的数据分析操作。赛题内容包括对论文数量、作者出现频率、论文源码的统计，对论文进行分类以及对论文作者的关系进行建模。Ⅰ、数据及背景主题：统计论文数量内容：理解赛题、学习利用 Pandas 读取数据并进行统计数据集：arXiv 重要的学术公开网站，也是搜索、浏览和下载学术论文的重要工具。arXiv论文涵盖的范围非常广，涉及物理学的庞大分支和计算机科学的众多子学科，如数学、统计学、电气工程、定量生物学和经济学等等。
复制链接

扫一扫