数据分析告诉你,AI论文怎么投

了解jason的数据

导入代码库

import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests 
import json
import pandas as pd
import matplotlib.pyplot as plt

读入数据

data = []

with open('./arxiv-metadata-oai-2019.json','r'as f:
    for line in f:
        data.append(json.loads(line))
data[0]
{'id': '0704.0297',
 'submitter': 'Sung-Chul Yoon',
 'authors': 'Sung-Chul Yoon, Philipp Podsiadlowski and Stephan Rosswog',
 'title': 'Remnant evolution after a carbon-oxygen white dwarf merger',
 'comments': '15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution\n  version; a high resolution version can be found at:\n  http://www.astro.uva.nl/~scyoon/papers/wdmerger.pdf)',
 'journal-ref': None,
 'doi': '10.1111/j.1365-2966.2007.12161.x',
 'report-no': None,
 'categories': 'astro-ph',
 'license': None,
 'abstract': '  We systematically explore the evolution of the merger of two carbon-oxygen\n(CO) white dwarfs. The dynamical evolution of a 0.9 Msun + 0.6 Msun CO white\ndwarf merger is followed by a three-dimensional SPH simulation. We use an\nelaborate prescription in which artificial viscosity is essentially absent,\nunless a shock is detected, and a much larger number of SPH particles than\nearlier calculations. Based on this simulation, we suggest that the central\nregion of the merger remnant can, once it has reached quasi-static equilibrium,\nbe approximated as a differentially rotating CO star, which consists of a\nslowly rotating cold core and a rapidly rotating hot envelope surrounded by a\ncentrifugally supported disc. We construct a model of the CO remnant that\nmimics the results of the SPH simulation using a one-dimensional hydrodynamic\nstellar evolution code and then follow its secular evolution. The stellar\nevolution models indicate that the growth of the cold core is controlled by\nneutrino cooling at the interface between the core and the hot envelope, and\nthat carbon ignition in the envelope can be avoided despite high effective\naccretion rates. This result suggests that the assumption of forced accretion\nof cold matter that was adopted in previous studies of the evolution of double\nCO white dwarf merger remnants may not be appropriate. Our results imply that\nat least some products of double CO white dwarfs merger may be considered good\ncandidates for the progenitors of Type Ia supernovae. In this case, the\ncharacteristic time delay between the initial dynamical merger and the eventual\nexplosion would be ~10^5 yr. (Abridged).\n',
 'versions': [{'version': 'v1', 'created': 'Tue, 3 Apr 2007 01:50:26 GMT'},
  {'version': 'v2', 'created': 'Wed, 4 Apr 2007 17:28:44 GMT'}],
 'update_date': '2019-08-19',
 'authors_parsed': [['Yoon', 'Sung-Chul', ''],
  ['Podsiadlowski', 'Philipp', ''],
  ['Rosswog', 'Stephan', '']]}
data = pd.DataFrame(data)
data.shape
(170618, 14)
data.head()

.dataframe thead th {
text-align: right;
}

idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsupdate_dateauthors_parsed
00704.0297Sung-Chul YoonSung-Chul Yoon, Philipp Podsiadlowski and Step...Remnant evolution after a carbon-oxygen white ...15 pages, 15 figures, 3 tables, submitted to M...None10.1111/j.1365-2966.2007.12161.xNoneastro-phNoneWe systematically explore the evolution of t...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...2019-08-19[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
10704.0342Patrice Ntumba PunguB. Dugmore and PP. NtumbaCofibrations in the Category of Frolicher Spac...27 pagesNoneNoneNonemath.ATNoneCofibrations are defined in the category of ...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...2019-08-19[[Dugmore, B., ], [Ntumba, PP., ]]
20704.0360ZaqarashviliT.V. Zaqarashvili and K MurawskiTorsional oscillations of longitudinally inhom...6 pages, 3 figures, accepted in A&ANone10.1051/0004-6361:20077246Noneastro-phNoneWe explore the effect of an inhomogeneous ma...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...2019-08-19[[Zaqarashvili, T. V., ], [Murawski, K, ]]
30704.0525Sezgin Ayg\"unSezgin Aygun, Ismail Tarhan, Husnu BaysalOn the Energy-Momentum Problem in Static Einst...This submission has been withdrawn by arXiv ad...Chin.Phys.Lett.24:355-358,200710.1088/0256-307X/24/2/015Nonegr-qcNoneThis paper has been removed by arXiv adminis...[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...2019-10-21[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
40704.0535Antonio PipinoAntonio Pipino (1,3), Thomas H. Puzia (2,4), a...The Formation of Globular Cluster Systems in M...32 pages (referee format), 9 figures, ApJ acce...Astrophys.J.665:295-305,200710.1086/519546Noneastro-phNoneThe most massive elliptical galaxies show a ...[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...2019-08-19[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...

数据预处理

cout: 一列数据的个数

unique : 一列数据的种类

top: 一列数据出现的最高频率

freq: 一列数据中出现最高元素的个数

#data.describe()

count 1778381

unique 61371

top astro-ph

freq 86914

Name: categories, dtype: object

#data['categories'].describe()

对于各种类别可能因为空格,导致大多数的论文类别无法统计出来,所以这个通过一些方法把这种情况统计出来。

unique_categories = set([i for l in [x.split(' 'for x in data['categories']] for i in l]) # 这个内部的for循环是从里往外循环的
len(unique_categories)
172
计算论文的类别

'cs.AI': 'Artificial Intelligence',

'cs.AR': 'Hardware Architecture',

'cs.CC': 'Computational Complexity',

'cs.CE': 'Computational Engineering, Finance, and Science',

'cs.CG': 'Computational Geometry',

'cs.CL': 'Computation and Language',

'cs.CR': 'Cryptography and Security',

'cs.CV': 'Computer Vision and Pattern Recognition',

'cs.CY': 'Computers and Society',

'cs.DB': 'Databases',

'cs.DC': 'Distributed, Parallel, and Cluster Computing',

'cs.DL': 'Digital Libraries',

'cs.DM': 'Discrete Mathematics',

'cs.DS': 'Data Structures and Algorithms',

'cs.ET': 'Emerging Technologies',

'cs.FL': 'Formal Languages and Automata Theory',

'cs.GL': 'General Literature',

'cs.GR': 'Graphics',

'cs.GT': 'Computer Science and Game Theory',

'cs.HC': 'Human-Computer Interaction',

'cs.IR': 'Information Retrieval',

'cs.IT': 'Information Theory',

'cs.LG': 'Machine Learning',

'cs.LO': 'Logic in Computer Science',

'cs.MA': 'Multiagent Systems',

'cs.MM': 'Multimedia',

'cs.MS': 'Mathematical Software',

'cs.NA': 'Numerical Analysis',

'cs.NE': 'Neural and Evolutionary Computing',

'cs.NI': 'Networking and Internet Architecture',

'cs.OH': 'Other Computer Science',

'cs.OS': 'Operating Systems',

'cs.PF': 'Performance',

'cs.PL': 'Programming Languages',

'cs.RO': 'Robotics',

'cs.SC': 'Symbolic Computation',

'cs.SD': 'Sound',

'cs.SE': 'Software Engineering',

'cs.SI': 'Social and Information Networks',

'cs.SY': 'Systems and Control',

我们对2019年之后的论文做分析,所以对时间特征做预处理

data['year'] = pd.to_datetime(data['update_date']).dt.year # 取出updata_date列,按照datatime的格式,然后提取year
data.tail(1)

.dataframe thead th {
text-align: right;
}

idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsupdate_dateauthors_parsedyear
170617solv-int/9909014David FairlieD.B. Fairlie and A.N. LeznovThe General Solution of the Complex Monge-Amp\...13 pages, latex, no figuresNone10.1088/0305-4470/33/25/307Nonesolv-int nlin.SINoneA general solution to the Complex Monge-Amp\...[{'version': 'v1', 'created': 'Thu, 16 Sep 199...2019-08-21[[Fairlie, D. B., ], [Leznov, A. N., ]]2019
# 删除update_date这一列
del data['update_date']
# 获得2019年的所有论文
data = data[data['year'] >= 2019]
data.head(1)

.dataframe thead th {
text-align: right;
}

idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsauthors_parsedyear
00704.0297Sung-Chul YoonSung-Chul Yoon, Philipp Podsiadlowski and Step...Remnant evolution after a carbon-oxygen white ...15 pages, 15 figures, 3 tables, submitted to M...None10.1111/j.1365-2966.2007.12161.xNoneastro-phNoneWe systematically explore the evolution of t...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...2019
# 重新编号
data.reset_index(drop=True, inplace=True)
data.head(1)

.dataframe thead th {
text-align: right;
}

idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsauthors_parsedyear
00704.0297Sung-Chul YoonSung-Chul Yoon, Philipp Podsiadlowski and Step...Remnant evolution after a carbon-oxygen white ...15 pages, 15 figures, 3 tables, submitted to M...None10.1111/j.1365-2966.2007.12161.xNoneastro-phNoneWe systematically explore the evolution of t...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...2019

从2019年的论文中,找到计算机领域论文

website_url = requests.get('https://arxiv.org/category_taxonomy').text # 获取网页的文本数据
website_url[0:100]
'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN'
soup = BeautifulSoup(website_url,'lxml'# 把杂乱的数据使用正常的html的形式表示出来
root = soup.find('div',{'id':'category_taxonomy_list'})  # 提取这个表示类别的id
tags = root.find_all(['h2','h3','h4','p'], recursive =True# 提取tags
#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []
#进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)
#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy.head()

.dataframe thead th {
text-align: right;
}

group_namearchive_namearchive_idcategory_namecategoriescategory_description
0Computer ScienceComputer ScienceComputer ScienceArtificial Intelligencecs.AICovers all areas of AI except Vision, Robotics...
1Computer ScienceComputer ScienceComputer ScienceHardware Architecturecs.ARCovers systems organization and hardware archi...
2Computer ScienceComputer ScienceComputer ScienceComputational Complexitycs.CCCovers models of computation, complexity class...
3Computer ScienceComputer ScienceComputer ScienceComputational Engineering, Finance, and Sciencecs.CECovers applications of computer science to the...
4Computer ScienceComputer ScienceComputer ScienceComputational Geometrycs.CGRoughly includes material in ACM Subject Class...

数据分析以及数据可视化

# 做出了分割以及计算
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df

.dataframe thead th {
text-align: right;
}

group_nameid
0Physics38379
1Mathematics24495
2Computer Science18087
3Statistics1802
4Electrical Engineering and Systems Science1371
5Quantitative Biology886
6Quantitative Finance352
7Economics173
fig = plt.figure(figsize=(15,12))
explode = (0,0,0,0.2,0.3,0.3,0.2,0.1)
plt.pie(_df['id'], labels=_df['group_name'],  autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()

plt.show()
png
png
group_name = 'Computer Science'
cats = data.merge(df_taxonomy, on='categories').query('group_name == @group_name')
cats.groupby(['year','category_name']).count().reset_index().pivot(index='category_name', columns = 'year', values = 'id')

.dataframe thead th {
text-align: right;
}

year2019
category_name
Artificial Intelligence558
Computation and Language2153
Computational Complexity131
Computational Engineering, Finance, and Science108
Computational Geometry199
Computer Science and Game Theory281
Computer Vision and Pattern Recognition5559
Computers and Society346
Cryptography and Security1067
Data Structures and Algorithms711
Databases282
Digital Libraries125
Discrete Mathematics84
Distributed, Parallel, and Cluster Computing715
Emerging Technologies101
Formal Languages and Automata Theory152
General Literature5
Graphics116
Hardware Architecture95
Human-Computer Interaction420
Information Retrieval245
Logic in Computer Science470
Machine Learning177
Mathematical Software27
Multiagent Systems85
Multimedia76
Networking and Internet Architecture864
Neural and Evolutionary Computing235
Numerical Analysis40
Operating Systems36
Other Computer Science67
Performance45
Programming Languages268
Robotics917
Social and Information Networks202
Software Engineering659
Sound7
Symbolic Computation44
Systems and Control415

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值