了解jason的数据
导入代码库
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
读入数据
data = []
with open('./arxiv-metadata-oai-2019.json','r') as f:
for line in f:
data.append(json.loads(line))
data[0]
{'id': '0704.0297',
'submitter': 'Sung-Chul Yoon',
'authors': 'Sung-Chul Yoon, Philipp Podsiadlowski and Stephan Rosswog',
'title': 'Remnant evolution after a carbon-oxygen white dwarf merger',
'comments': '15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution\n version; a high resolution version can be found at:\n http://www.astro.uva.nl/~scyoon/papers/wdmerger.pdf)',
'journal-ref': None,
'doi': '10.1111/j.1365-2966.2007.12161.x',
'report-no': None,
'categories': 'astro-ph',
'license': None,
'abstract': ' We systematically explore the evolution of the merger of two carbon-oxygen\n(CO) white dwarfs. The dynamical evolution of a 0.9 Msun + 0.6 Msun CO white\ndwarf merger is followed by a three-dimensional SPH simulation. We use an\nelaborate prescription in which artificial viscosity is essentially absent,\nunless a shock is detected, and a much larger number of SPH particles than\nearlier calculations. Based on this simulation, we suggest that the central\nregion of the merger remnant can, once it has reached quasi-static equilibrium,\nbe approximated as a differentially rotating CO star, which consists of a\nslowly rotating cold core and a rapidly rotating hot envelope surrounded by a\ncentrifugally supported disc. We construct a model of the CO remnant that\nmimics the results of the SPH simulation using a one-dimensional hydrodynamic\nstellar evolution code and then follow its secular evolution. The stellar\nevolution models indicate that the growth of the cold core is controlled by\nneutrino cooling at the interface between the core and the hot envelope, and\nthat carbon ignition in the envelope can be avoided despite high effective\naccretion rates. This result suggests that the assumption of forced accretion\nof cold matter that was adopted in previous studies of the evolution of double\nCO white dwarf merger remnants may not be appropriate. Our results imply that\nat least some products of double CO white dwarfs merger may be considered good\ncandidates for the progenitors of Type Ia supernovae. In this case, the\ncharacteristic time delay between the initial dynamical merger and the eventual\nexplosion would be ~10^5 yr. (Abridged).\n',
'versions': [{'version': 'v1', 'created': 'Tue, 3 Apr 2007 01:50:26 GMT'},
{'version': 'v2', 'created': 'Wed, 4 Apr 2007 17:28:44 GMT'}],
'update_date': '2019-08-19',
'authors_parsed': [['Yoon', 'Sung-Chul', ''],
['Podsiadlowski', 'Philipp', ''],
['Rosswog', 'Stephan', '']]}
data = pd.DataFrame(data)
data.shape
(170618, 14)
data.head()
.dataframe thead th {
text-align: right;
}
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0704.0297 | Sung-Chul Yoon | Sung-Chul Yoon, Philipp Podsiadlowski and Step... | Remnant evolution after a carbon-oxygen white ... | 15 pages, 15 figures, 3 tables, submitted to M... | None | 10.1111/j.1365-2966.2007.12161.x | None | astro-ph | None | We systematically explore the evolution of t... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | 2019-08-19 | [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... |
1 | 0704.0342 | Patrice Ntumba Pungu | B. Dugmore and PP. Ntumba | Cofibrations in the Category of Frolicher Spac... | 27 pages | None | None | None | math.AT | None | Cofibrations are defined in the category of ... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | 2019-08-19 | [[Dugmore, B., ], [Ntumba, PP., ]] |
2 | 0704.0360 | Zaqarashvili | T.V. Zaqarashvili and K Murawski | Torsional oscillations of longitudinally inhom... | 6 pages, 3 figures, accepted in A&A | None | 10.1051/0004-6361:20077246 | None | astro-ph | None | We explore the effect of an inhomogeneous ma... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | 2019-08-19 | [[Zaqarashvili, T. V., ], [Murawski, K, ]] |
3 | 0704.0525 | Sezgin Ayg\"un | Sezgin Aygun, Ismail Tarhan, Husnu Baysal | On the Energy-Momentum Problem in Static Einst... | This submission has been withdrawn by arXiv ad... | Chin.Phys.Lett.24:355-358,2007 | 10.1088/0256-307X/24/2/015 | None | gr-qc | None | This paper has been removed by arXiv adminis... | [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... | 2019-10-21 | [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... |
4 | 0704.0535 | Antonio Pipino | Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... | The Formation of Globular Cluster Systems in M... | 32 pages (referee format), 9 figures, ApJ acce... | Astrophys.J.665:295-305,2007 | 10.1086/519546 | None | astro-ph | None | The most massive elliptical galaxies show a ... | [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... | 2019-08-19 | [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... |
数据预处理
cout: 一列数据的个数
unique : 一列数据的种类
top: 一列数据出现的最高频率
freq: 一列数据中出现最高元素的个数
#data.describe()
count 1778381
unique 61371
top astro-ph
freq 86914
Name: categories, dtype: object
#data['categories'].describe()
对于各种类别可能因为空格,导致大多数的论文类别无法统计出来,所以这个通过一些方法把这种情况统计出来。
unique_categories = set([i for l in [x.split(' ') for x in data['categories']] for i in l]) # 这个内部的for循环是从里往外循环的
len(unique_categories)
172
计算论文的类别
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CG': 'Computational Geometry',
'cs.CL': 'Computation and Language',
'cs.CR': 'Cryptography and Security',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.DM': 'Discrete Mathematics',
'cs.DS': 'Data Structures and Algorithms',
'cs.ET': 'Emerging Technologies',
'cs.FL': 'Formal Languages and Automata Theory',
'cs.GL': 'General Literature',
'cs.GR': 'Graphics',
'cs.GT': 'Computer Science and Game Theory',
'cs.HC': 'Human-Computer Interaction',
'cs.IR': 'Information Retrieval',
'cs.IT': 'Information Theory',
'cs.LG': 'Machine Learning',
'cs.LO': 'Logic in Computer Science',
'cs.MA': 'Multiagent Systems',
'cs.MM': 'Multimedia',
'cs.MS': 'Mathematical Software',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'cs.PF': 'Performance',
'cs.PL': 'Programming Languages',
'cs.RO': 'Robotics',
'cs.SC': 'Symbolic Computation',
'cs.SD': 'Sound',
'cs.SE': 'Software Engineering',
'cs.SI': 'Social and Information Networks',
'cs.SY': 'Systems and Control',
我们对2019年之后的论文做分析,所以对时间特征做预处理
data['year'] = pd.to_datetime(data['update_date']).dt.year # 取出updata_date列,按照datatime的格式,然后提取year
data.tail(1)
.dataframe thead th {
text-align: right;
}
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
170617 | solv-int/9909014 | David Fairlie | D.B. Fairlie and A.N. Leznov | The General Solution of the Complex Monge-Amp\... | 13 pages, latex, no figures | None | 10.1088/0305-4470/33/25/307 | None | solv-int nlin.SI | None | A general solution to the Complex Monge-Amp\... | [{'version': 'v1', 'created': 'Thu, 16 Sep 199... | 2019-08-21 | [[Fairlie, D. B., ], [Leznov, A. N., ]] | 2019 |
# 删除update_date这一列
del data['update_date']
# 获得2019年的所有论文
data = data[data['year'] >= 2019]
data.head(1)
.dataframe thead th {
text-align: right;
}
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | authors_parsed | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0704.0297 | Sung-Chul Yoon | Sung-Chul Yoon, Philipp Podsiadlowski and Step... | Remnant evolution after a carbon-oxygen white ... | 15 pages, 15 figures, 3 tables, submitted to M... | None | 10.1111/j.1365-2966.2007.12161.x | None | astro-ph | None | We systematically explore the evolution of t... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... | 2019 |
# 重新编号
data.reset_index(drop=True, inplace=True)
data.head(1)
.dataframe thead th {
text-align: right;
}
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | authors_parsed | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0704.0297 | Sung-Chul Yoon | Sung-Chul Yoon, Philipp Podsiadlowski and Step... | Remnant evolution after a carbon-oxygen white ... | 15 pages, 15 figures, 3 tables, submitted to M... | None | 10.1111/j.1365-2966.2007.12161.x | None | astro-ph | None | We systematically explore the evolution of t... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... | 2019 |
从2019年的论文中,找到计算机领域论文
website_url = requests.get('https://arxiv.org/category_taxonomy').text # 获取网页的文本数据
website_url[0:100]
'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN'
soup = BeautifulSoup(website_url,'lxml') # 把杂乱的数据使用正常的html的形式表示出来
root = soup.find('div',{'id':'category_taxonomy_list'}) # 提取这个表示类别的id
tags = root.find_all(['h2','h3','h4','p'], recursive =True) # 提取tags
#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []
#进行
for t in tags:
if t.name == "h2":
level_1_name = t.text
level_2_code = t.text
level_2_name = t.text
elif t.name == "h3":
raw = t.text
level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
elif t.name == "h4":
raw = t.text
level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
elif t.name == "p":
notes = t.text
level_1_names.append(level_1_name)
level_2_names.append(level_2_name)
level_2_codes.append(level_2_code)
level_3_names.append(level_3_name)
level_3_codes.append(level_3_code)
level_3_notes.append(notes)
#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
'group_name' : level_1_names,
'archive_name' : level_2_names,
'archive_id' : level_2_codes,
'category_name' : level_3_names,
'categories' : level_3_codes,
'category_description': level_3_notes
})
#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy.head()
.dataframe thead th {
text-align: right;
}
group_name | archive_name | archive_id | category_name | categories | category_description | |
---|---|---|---|---|---|---|
0 | Computer Science | Computer Science | Computer Science | Artificial Intelligence | cs.AI | Covers all areas of AI except Vision, Robotics... |
1 | Computer Science | Computer Science | Computer Science | Hardware Architecture | cs.AR | Covers systems organization and hardware archi... |
2 | Computer Science | Computer Science | Computer Science | Computational Complexity | cs.CC | Covers models of computation, complexity class... |
3 | Computer Science | Computer Science | Computer Science | Computational Engineering, Finance, and Science | cs.CE | Covers applications of computer science to the... |
4 | Computer Science | Computer Science | Computer Science | Computational Geometry | cs.CG | Roughly includes material in ACM Subject Class... |
数据分析以及数据可视化
# 做出了分割以及计算
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
_df
.dataframe thead th {
text-align: right;
}
group_name | id | |
---|---|---|
0 | Physics | 38379 |
1 | Mathematics | 24495 |
2 | Computer Science | 18087 |
3 | Statistics | 1802 |
4 | Electrical Engineering and Systems Science | 1371 |
5 | Quantitative Biology | 886 |
6 | Quantitative Finance | 352 |
7 | Economics | 173 |
fig = plt.figure(figsize=(15,12))
explode = (0,0,0,0.2,0.3,0.3,0.2,0.1)
plt.pie(_df['id'], labels=_df['group_name'], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()
group_name = 'Computer Science'
cats = data.merge(df_taxonomy, on='categories').query('group_name == @group_name')
cats.groupby(['year','category_name']).count().reset_index().pivot(index='category_name', columns = 'year', values = 'id')
.dataframe thead th {
text-align: right;
}
year | 2019 |
---|---|
category_name | |
Artificial Intelligence | 558 |
Computation and Language | 2153 |
Computational Complexity | 131 |
Computational Engineering, Finance, and Science | 108 |
Computational Geometry | 199 |
Computer Science and Game Theory | 281 |
Computer Vision and Pattern Recognition | 5559 |
Computers and Society | 346 |
Cryptography and Security | 1067 |
Data Structures and Algorithms | 711 |
Databases | 282 |
Digital Libraries | 125 |
Discrete Mathematics | 84 |
Distributed, Parallel, and Cluster Computing | 715 |
Emerging Technologies | 101 |
Formal Languages and Automata Theory | 152 |
General Literature | 5 |
Graphics | 116 |
Hardware Architecture | 95 |
Human-Computer Interaction | 420 |
Information Retrieval | 245 |
Logic in Computer Science | 470 |
Machine Learning | 177 |
Mathematical Software | 27 |
Multiagent Systems | 85 |
Multimedia | 76 |
Networking and Internet Architecture | 864 |
Neural and Evolutionary Computing | 235 |
Numerical Analysis | 40 |
Operating Systems | 36 |
Other Computer Science | 67 |
Performance | 45 |
Programming Languages | 268 |
Robotics | 917 |
Social and Information Networks | 202 |
Software Engineering | 659 |
Sound | 7 |
Symbolic Computation | 44 |
Systems and Control | 415 |