2021-01-16

最新推荐文章于 2021-01-17 02:11:12 发布

licoky

最新推荐文章于 2021-01-17 02:11:12 发布

阅读量55

点赞数

本文链接：https://blog.csdn.net/weixin_46525585/article/details/112726964

版权

DataWhale Task2/论文作者统计
任务说明

任务主题：论文作者统计
任务内容：使用Pandas读取并使用字符串操作
任务结果：掌握Pandas字符串操作

字符串处理

方法	说明
str.casefold()	将字符串全部小写
str.capitalize()	第一个字符串大写
str.join()	字符串拼接
str.upper()	小写转大写

具体代码

import seaborn as sns #用于画图 
from bs4 import BeautifulSoup #用于爬取arxiv的数据 
import re #用于正则表达式，匹配字符串的模式 
import requests #用于网络连接，发送网络请求，使用域名获取对应信息 
import json #读取数据，我们的数据为json格式的 
import pandas as pd #数据处理，数据分析 
import matplotlib.pyplot as plt

def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed'], count=None): 
''' 定义读取文件的函数 path: 文件路径 columns: 需要选择的列 count: 读取行数 ''' 
    data = [] 
    with open(path, 'r') as f: 
        for idx, line in enumerate(f): 
            if idx == count: 
                break 
             d = json.loads(line) 
             d = {col : d[col] for col in columns}
             data.append(d) 
    data = pd.DataFrame(data) 
    return data
data = readArxivFile('arxiv-metadata-oai-snapshot.json', ['id', 'authors', 'categories', 'authors_parsed'], 100000)

# 选择类别为cs.CV下面的论文 
data2 = data[data['categories'].apply(lambda x: 'cs.CV' in x)] # 拼接所有作者 
all_authors = sum(data2['authors_parsed'], [])

# 拼接所有的作者 
authors_names = [' '.join(x) for x in all_authors] authors_names = pd.DataFrame(authors_names) # 根据作者频率绘制直方图 
plt.figure(figsize=(10, 6)) authors_names[0].value_counts().head(10).plot(kind='barh') # 修改图配置 
names = authors_names[0].value_counts().index.values[:10] _ = plt.yticks(range(0, len(names)), names) plt.ylabel('Author') plt.xlabel('Count')

在这里插入图片描述

authors_lastnames = [x[0] for x in all_authors] authors_lastnames = pd.DataFrame(authors_lastnames) plt.figure(figsize=(10, 6)) authors_lastnames[0].value_counts().head(10).plot(kind='barh') names = authors_lastnames[0].value_counts().index.values[:10] 
_ = plt.yticks(range(0, len(names)), names) plt.ylabel('Author') plt.xlabel('Count')

在这里插入图片描述

licoky

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2021-01-16

DataWhale Task2/论文作者统计任务说明任务主题：论文作者统计任务内容：使用Pandas读取并使用字符串操作任务结果：掌握Pandas字符串操作字符串处理方法说明str.casefold()将字符串全部小写str.capitalize()第一个字符串大写str.join()字符串拼接str.upper()小写转大写具体代码import seaborn as sns #用于画图 from bs4 import Beautifu
复制链接

扫一扫