第一章论文数据统计

最新推荐文章于 2021-01-27 00:14:59 发布

减肥的卡比兽

最新推荐文章于 2021-01-27 00:14:59 发布

阅读量887

点赞数

分类专栏： datawhale的数据分析学习文章标签：数据分析

本文链接：https://blog.csdn.net/zzj960321/article/details/112598331

版权

该博客介绍了如何对论文数据进行统计分析，包括数据预处理、类别统计和时间特征筛选。重点是2019年后计算机科学领域的论文，特别是Computer Vision and Pattern Recognition子类的论文数量最多，且逐年增长。还提到了其他如Computation and Language、Cryptography and Security及Robotics等子领域的论文情况。

摘要由CSDN通过智能技术生成

导入package并读取原始数据

# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式，匹配字符串的模式
import requests #用于网络连接，发送网络请求，使用域名获取对应信息
import json #读取数据，我们的数据为json格式的
import pandas as pd #数据处理，数据分析
import matplotlib.pyplot as plt #画图工具

# 读入数据
data  = []

#使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open("arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for idx, line in enumerate(f): 
        
        # 读取前100行，如果读取所有数据需要8G内存
        if idx >= 1000:
            break
        
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析
data.shape #显示数据大小
# (1000, 14)

data.head() #显示数据的前五行

id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	versions	update_date	authors_parsed
0	0704.0001	Pavel Nadolsky	C. Bal’azs, E. L. Berger, P. M. Nadolsky, C.-…	Calculation of prompt diphoton production cros…	37 pages, 15 figures; published version	Phys.Rev.D76:013009,2007	10.1103/PhysRevD.76.013009	ANL-HEP-PR-07-12	hep-ph	None	A fully differential calculation in perturba…	[{‘version’: ‘v1’, ‘created’: 'Mon, 2 Apr 2007…	2008-11-26	[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,…
1	0704.0002	Louis Theran	Ileana Streinu and Louis Theran	Sparsity-certifying Graph Decompositions	To appear in Graphs and Combinatorics	None	None	None	math.CO cs.CG	http://arxiv.org/licenses/nonexclusive-distrib…	We describe a new algorithm, the (𝑘,ℓ)(k,ℓ)-…	[{‘version’: ‘v1’, ‘created’: 'Sat, 31 Mar 200…	2008-12-13	[[Streinu, Ileana, ], [Theran, Louis, ]]
2	0704.0003	Hongjun Pan	Hongjun Pan	The evolution of the Earth-Moon system based o…	23 pages, 3 figures	None	None	None	physics.gen-ph	None	The evolution of Earth-Moon system is descri…	[{‘version’: ‘v1’, ‘created’: 'Sun, 1 Apr 2007…	2008-01-13	[[Pan, Hongjun, ]]
3	0704.0004	David Callan	David Callan	A determinant of Stirling cycle numbers counts…	11 pages	None	None	None	math.CO	None	We show that a determinant of Stirling cycle…	[{‘version’: ‘v1’, ‘created’: 'Sat, 31 Mar 200…	2007-05-23	[[Callan, David, ]]
4	0704.0005	Alberto Torchinsky	Wael Abu-Shammala and Alberto Torchinsky	From dyadic Λ𝛼Λα to $\Lambda_{\a…	None	Illinois J. Math. 52 (2008) no.2, 681-689	None	None	math.CA math.FA	None

def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'], count=None):
    '''
    定义读取文件的函数
        path: 文件路径
        columns: 需要选择的列
        count: 读取行数
    '''
    
    data  =

最低0.47元/天解锁文章

减肥的卡比兽

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
第一章论文数据统计

导入package并读取原始数据# 导入所需的packageimport seaborn as sns #用于画图from bs4 import BeautifulSoup #用于爬取arxiv的数据import re #用于正则表达式，匹配字符串的模式import requests #用于网络连接，发送网络请求，使用域名获取对应信息import json #读取数据，我们的数据为json格式的import pandas as pd #数据处理，数据分析import matplotlib.p
复制链接

扫一扫