Day01-数据分析实战-论文数量统计(DataWhale)

本文链接：https://blog.csdn.net/liying_tt/article/details/112597789

一、论文数量统计

统计2019年全年计算机各个方向论文数量

步骤：

1.找到update为2019年的数据

2.找出categories为计算机的数据

3.统计数量

1. 读取原始数据

#导入包
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #爬取数据
import re #正则，匹配字符串模式
import requests #网络连接，发送网络请求，使用域名获取对应信息
import json #读取json格式数据
import pandas as pd #数据处理，数据分析
import matplotlib.pyplot as plt #画图

data = [] #初始化
#使用with语句优势，1.自动关闭文件句柄；2.自动显示(处理)文件读取数据异常
with open("arxiv-metadata-oai-snapshot.json",'r') as f:
    for line in f:
        data.append(json.loads(line))
        
'''    
    for idx, line in enumerate(f):         
# 读取前100行，查看数据的时候，不需要跑很多，此处一定要注意
        if idx >= 100:
            break    
'''
data = pd.DataFrame(data) #将list变为DataFrame格式，方便分析
data.shape #显示数据大小

(1796911, 14)

data.head(2)

	id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	versions	update_date	authors_parsed
0	0704.0001	Pavel Nadolsky	C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...	Calculation of prompt diphoton production cros...	37 pages, 15 figures; published version	Phys.Rev.D76:013009,2007	10.1103/PhysRevD.76.013009	ANL-HEP-PR-07-12	hep-ph	None	A fully differential calculation in perturba...	[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...	2008-11-26	[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
1	0704.0002	Louis Theran	Ileana Streinu and Louis Theran	Sparsity-certifying Graph Decompositions	To appear in Graphs and Combinatorics	None	None	None	math.CO cs.CG	http://arxiv.org/licenses/nonexclusive-distrib...	We describe a new algorithm, the $(k,\ell)$-...	[{'version': 'v1', 'created': 'Sat, 31 Mar 200...	2008-12-13	[[Streinu, Ileana, ], [Theran, Louis, ]]

数据集的字段解释：

id：arXiv ID，可用于访问论文；
submitter：论文提交者；
authors：论文作者；
title：论文标题；
comments：论文页数和图表等其他信息；
journal-ref：论文发表的期刊的信息；
doi：数字对象标识符，https://www.doi.org；
report-no：报告编号；
categories：论文在 arXiv 系统的所属类别或标签；
license：文章的许可证；
abstract：论文摘要；
versions：论文版本；
authors_parsed：作者的信息。

2. 数据预处理

首先查看论文的种类信息，目的是了解一下数据集的基本信息

data['categories'].describe()

count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

-count：元素个数；

-unique：元素的不同种类；

-top：出现频率最高的元素；

-freq：出现频率最高的元素个数；

data['categories'].head(4)

0            hep-ph
1     math.CO cs.CG
2    physics.gen-ph
3           math.CO
Name: categories, dtype: object

查看一下categories的分类信息，同时需要依据官方的论文种类对其进行整理，方便我们找到计算机类的数据。

从官网爬取类别数据

#获取网页文本数据
websit_url