Task3 论文页数图表代码统计

最新推荐文章于 2024-09-11 12:37:05 发布

Tinali_127

最新推荐文章于 2024-09-11 12:37:05 发布

阅读量401

点赞数 1

文章标签：数据分析

本文链接：https://blog.csdn.net/Tinali_127/article/details/112853881

版权

该博客详细介绍了如何使用正则表达式统计论文的页数、图表个数以及代码链接。通过分析comments字段，提取出页数和图表数据，发现页数和图表分布具有右偏特性。在不同类别中，经济类(econ)论文平均页数最多，但异常值较少，而math-ph和hep-th类论文页数异常值较多。同时，cs和astro-ph类论文的代码链接数量位于前列。

摘要由CSDN通过智能技术生成

1. 任务说明

任务主题：论文代码统计，统计所有论文出现代码的相关统计；
任务内容：使用正则表达式统计代码连接、页数和图表数据；
任务成果：学习正则表达式统计；`

# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式，匹配字符串的模式
import requests #用于网络连接，发送网络请求，使用域名获取对应信息
import json #读取数据，我们的数据为json格式的
import pandas as pd #数据处理，数据分析
import numpy as np
import matplotlib.pyplot as plt #画图工具

2. 数据读取

def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'], count=None):
    '''
    定义读取文件的函数
        path: 文件相对路径
        columns: 需要选择的列
        count: 读取行数(原数据有17万+行)
    '''
    
    data  = []
    with open(path, 'r') as f: 
        for idx, line in enumerate(f): 
            if idx == count: # 索引从0开始，所以idx=count-->已经是第count+1条数据
                break
                
            # 读取每一行数据
            d = json.loads(line) # **关心所有列**--原始的样本：包含所有列的字典形式
            d = {
   col : d[col] for col in columns} # **关心其中某几列**--用字典生成式，key=列名，value=样本中对应列名的值 # 如果需要所有列，直接json.loads就行
            #print(d)
            data.append(d)

    data = pd.DataFrame(data)
    return data

data = readArxivFile(path="./data/arxiv-metadata-oai-2019.json",columns = ["id","abstract","categories","comments"])
data.head()

	id	abstract	categories	comments
0	0704.0297	We systematically explore the evolution of t...	astro-ph	15 pages, 15 figures, 3 tables, submitted to M...
1	0704.0342	Cofibrations are defined in the category of ...	math.AT	27 pages
2	0704.0360	We explore the effect of an inhomogeneous ma...	astro-ph	6 pages, 3 figures, accepted in A&A
3	0704.0525	This paper has been removed by arXiv adminis...	gr-qc	This submission has been withdrawn by arXiv ad...
4	0704.0535	The most massive elliptical galaxies show a ...	astro-ph	32 pages (referee format), 9 figures, ApJ acce...

data["comments"][0]

'15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution\n  version; a high resolution version can be found at:\n  http://www.astro.uva.nl/~scyoon/papers/wdmerger.pdf)'

for index,comment in enumerate(data["comments"].head(10)):
    print(index,comment) # comments字段中会有具体代码的链接

0 15 pages, 15 figures, 3 tables, submitted to MNRAS (Low resolution
  version; a high resolution version can be found at:
  http://www.astro.uva.nl/~scyoon/papers/wdmerger.pdf)
1 27 pages
2 6 pages, 3 figures, accepted in A&A
3 This submission has been withdrawn by arXiv administrators due to
  inappropriate text reuse from external sources
4 32 pages (referee format), 9 figures, ApJ accepted
5 8 pages, 13 figures
6 5 pages, pdf format
7 30 pages
8 6 pages, 4 figures, Submitted to Physical Review Letters
9 34 pages, 9 figures, accepted for publication in ApJ

3. 统计论文页数–comments字段中的pages

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170618 entries, 0 to 170617
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          170618 non-null  object
 1   abstract    170618 non-null  object
 2   categories  170618 non-null  object
 3   comments    118104 non-null  object
dtypes: object(4)
memory usage: 5.2+ MB

3.1 re.findall(pattern, string, flags=0)

Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group

If no such pattern detected in the string–>return 空list

pages_pattern = "[1-9][0-9]* pages"
re.findall(pages_pattern,"10 pages,11 figures,20 pages") #如果findall有多个匹配，都会放到list里--len就不是1了

['10 pages', '20 pages']

pages_pattern = "[1-9][0-9]* pages" # 匹配至少一位的数字
# 至少是一位数--肯定在1--9的范围内，其他位数的范围就是0-9了，[0-9]* 代表0-9的数字匹配0次/多次（0次：1位数 多次：至少两位数）

re.findall(pages_pattern,data["comments"][0])

['15 pages']

data.comments.apply(lambda x:re.findall(pages_pattern,str(x))).head(10) #保险一点，用str函数进行数据类型的转换--但其实comments这一列都是object类型
# 只显示前10行

最低0.47元/天解锁文章

Tinali_127

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫