1. 任务说明
- 任务主题:论文代码统计,统计所有论文出现代码的相关统计;
- 任务内容:使用正则表达式统计代码连接、页数和图表数据;
- 任务成果:学习正则表达式统计;`
# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import numpy as np
import matplotlib.pyplot as plt #画图工具
2. 数据读取
def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
'report-no', 'categories', 'license', 'abstract', 'versions',
'update_date', 'authors_parsed'], count=None):
'''
定义读取文件的函数
path: 文件相对路径
columns: 需要选择的列
count: 读取行数(原数据有17万+行)
'''
data = []
with open(path, 'r') as f:
for idx, line in enumerate(f):
if idx == count: # 索引从0开始,所以idx=count-->已经是第count+1条数据
break
# 读取每一行数据
d = json.loads(line) # **关心所有列**--原始的样本:包含所有列的字典形式
d = {
col : d[col] for col in columns} # **关心其中某几列**--用字典生成式,key=列名,value=样本中对应列名的值 # 如果需要所有列,直接json.loads就行
#print(d)
data.append(d)
data = pd.DataFrame(data)
return data
data = readArxivFile(path="./data/arxiv-metadata-oai-2019.json",columns = ["id","abstract","categories","comments"])
data.head()
id | abstract | categories | comments | |
---|---|---|---|---|
0 | 0704.0297 | We systematically explore the evolution of t... | astro-ph | 15 pages, 15 figures, 3 tables, submitted to M... |
1 | 0704.0342 | Cofibrations are defined in the category of ... | math.AT | 27 pages |
2 | 0704.0360 | We explore the effect of an inhomogeneous ma... | astro-ph | 6 pages, 3 figures, accepted in A&A |
3 | 0704.0525 | This paper has been removed by arXiv adminis... | gr-qc | This submission has been withdrawn by arXiv ad... |
4 | 0704.0535 | The most massive elliptical galaxies show a ... | astro-ph | 32 pages (referee format), 9 figures, ApJ acce... |
data["comments"][0]
'15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution\n version; a high resolution version can be found at:\n http://www.astro.uva.nl/~scyoon/papers/wdmerger.pdf)'
for index,comment in enumerate(data["comments"].head(10)):
print(index,comment) # comments字段中会有具体代码的链接
0 15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution
version; a high resolution version can be found at:
http://www.astro.uva.nl/~scyoon/papers/wdmerger.pdf)
1 27 pages
2 6 pages, 3 figures, accepted in A&A
3 This submission has been withdrawn by arXiv administrators due to
inappropriate text reuse from external sources
4 32 pages (referee format), 9 figures, ApJ accepted
5 8 pages, 13 figures
6 5 pages, pdf format
7 30 pages
8 6 pages, 4 figures, Submitted to Physical Review Letters
9 34 pages, 9 figures, accepted for publication in ApJ
3. 统计论文页数–comments字段中的pages
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170618 entries, 0 to 170617
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 170618 non-null object
1 abstract 170618 non-null object
2 categories 170618 non-null object
3 comments 118104 non-null object
dtypes: object(4)
memory usage: 5.2+ MB
3.1 re.findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group
If no such pattern detected in the string–>return 空list
pages_pattern = "[1-9][0-9]* pages"
re.findall(pages_pattern,"10 pages,11 figures,20 pages") #如果findall有多个匹配,都会放到list里--len就不是1了
['10 pages', '20 pages']
pages_pattern = "[1-9][0-9]* pages" # 匹配至少一位的数字
# 至少是一位数--肯定在1--9的范围内,其他位数的范围就是0-9了,[0-9]* 代表0-9的数字匹配0次/多次(0次:1位数 多次:至少两位数)
re.findall(pages_pattern,data["comments"][0])
['15 pages']
data.comments.apply(lambda x:re.findall(pages_pattern,str(x))).head(10) #保险一点,用str函数进行数据类型的转换--但其实comments这一列都是object类型
# 只显示前10行