Task3 论文页数图表代码统计


1. 任务说明

  • 任务主题:论文代码统计,统计所有论文出现代码的相关统计;
  • 任务内容:使用正则表达式统计代码连接、页数和图表数据;
  • 任务成果:学习正则表达式统计;`
# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import numpy as np
import matplotlib.pyplot as plt #画图工具

2. 数据读取

def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'], count=None):
        path: 文件相对路径
        columns: 需要选择的列
        count: 读取行数(原数据有17万+行)
    data  = []
    with open(path, 'r') as f: 
        for idx, line in enumerate(f): 
            if idx == count: # 索引从0开始,所以idx=count-->已经是第count+1条数据
            # 读取每一行数据
            d = json.loads(line) # **关心所有列**--原始的样本:包含所有列的字典形式
            d = {
   col : d[col] for col in columns} # **关心其中某几列**--用字典生成式,key=列名,value=样本中对应列名的值 # 如果需要所有列,直接json.loads就行

    data = pd.DataFrame(data)
    return data
data = readArxivFile(path="./data/arxiv-metadata-oai-2019.json",columns = ["id","abstract","categories","comments"])
id abstract categories comments
0 0704.0297 We systematically explore the evolution of t... astro-ph 15 pages, 15 figures, 3 tables, submitted to M...
1 0704.0342 Cofibrations are defined in the category of ... math.AT 27 pages
2 0704.0360 We explore the effect of an inhomogeneous ma... astro-ph 6 pages, 3 figures, accepted in A&A
3 0704.0525 This paper has been removed by arXiv adminis... gr-qc This submission has been withdrawn by arXiv ad...
4 0704.0535 The most massive elliptical galaxies show a ... astro-ph 32 pages (referee format), 9 figures, ApJ acce...
'15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution\n  version; a high resolution version can be found at:\n  http://www.astro.uva.nl/~scyoon/papers/wdmerger.pdf)'
for index,comment in enumerate(data["comments"].head(10)):
    print(index,comment) # comments字段中会有具体代码的链接
0 15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution
  version; a high resolution version can be found at:
1 27 pages
2 6 pages, 3 figures, accepted in A&A
3 This submission has been withdrawn by arXiv administrators due to
  inappropriate text reuse from external sources
4 32 pages (referee format), 9 figures, ApJ accepted
5 8 pages, 13 figures
6 5 pages, pdf format
7 30 pages
8 6 pages, 4 figures, Submitted to Physical Review Letters
9 34 pages, 9 figures, accepted for publication in ApJ

3. 统计论文页数–comments字段中的pages

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170618 entries, 0 to 170617
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          170618 non-null  object
 1   abstract    170618 non-null  object
 2   categories  170618 non-null  object
 3   comments    118104 non-null  object
dtypes: object(4)
memory usage: 5.2+ MB

3.1 re.findall(pattern, string, flags=0)

Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group

If no such pattern detected in the string–>return 空list

pages_pattern = "[1-9][0-9]* pages"
re.findall(pages_pattern,"10 pages,11 figures,20 pages") #如果findall有多个匹配,都会放到list里--len就不是1了
['10 pages', '20 pages']
pages_pattern = "[1-9][0-9]* pages" # 匹配至少一位的数字
# 至少是一位数--肯定在1--9的范围内,其他位数的范围就是0-9了,[0-9]* 代表0-9的数字匹配0次/多次(0次:1位数 多次:至少两位数)
['15 pages']
data.comments.apply(lambda x:re.findall(pages_pattern,str(x))).head(10) #保险一点,用str函数进行数据类型的转换--但其实comments这一列都是object类型
# 只显示前10行

