【Academic Trends】Task 02:论文作者统计.

任务背景

  1. 统计论文作者,选出出现频率TOP10的姓名
  2. 主要难点:一篇文章可能有多个作者

数据处理步骤

观察数据

在原始arxiv数据集中论文作者authors字段是一个字符串格式,其中每个作者使用逗号进行分隔分。

处理步骤

  1. 使用逗号对作者进行切分
  2. 剔除单个作者中非常规的字符:在实际工作中,可能需要多次尝试和观察,把非常规字符找出

当然在原始数据集中authors_parsed字段已经帮我们处理好了作者信息,可以直接使用该字段完成后续统计。

字符串处理

转义符

(在行尾时)(续行符)
|反斜杠符号
单引号
双引号
\n换行
\t横向制表符
\r回车

内置函数(针对字符串处理)

方法描述
string.capitalize()把字符串的第一个字符大写
string.isalpha()如果 string 至少有一个字符并且所有字符都是字母则返回 True,否则返回 False
string.title()返回"标题化"的 string,就是说所有单词都是以大写开始,其余字母均为小写(见 istitle())
string.upper()转换 string 中的小写字母为大写

具体代码实现以及讲解

读取数据

import pandas as pd
import json

data = []
with open(r'E:\学习相关文档\datawhale\202101\arxiv-metadata-oai-2019.json\arxiv-metadata-oai-2019.json','r') as f:
    for idx,line in enumerate(f):
        d = json.loads(line)
        #d是字典
        d = {'authors':d['authors'],'categories':d['categories'],'authors_parsed':d['authors_parsed']}
        #data是列表
        data.append(d)
#由字典组成的列表转换为dataframe
data = pd.DataFrame(data)

补充:
enumerate()函数跟单纯用for循环遍历的差别在于,enumerate会返回元素对应索引。

数据统计

统计目的

  1. 统计姓名频率TOP10
  2. 统计姓频率TOP10
  3. 统计姓第一个字符的频率

实现代码

#选择类别为cs.CV下面的论文
data2 = data[data['categories'].apply(lambda x: 'cs.CV' in x)]
#其他实现方式:通过包含字段进行筛选
#data2 = data[data['categories'].str.contains('cs.CV')]

#拼接所有作者
#把series转变成list
all_authors = sum(data2['authors_parsed'],[])
#拼接所有作者
authors_names = [''.join(x) for x in all_authors]
authors_names = pd.DataFrame(authors_name)

# 根据作者频率绘制直方图
plt.figure(figsize=(10, 6))
authors_names[0].value_counts().head(10).plot(kind='barh')

# 修改图配置
names = authors_names[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')

在这里插入图片描述
补充:

  1. apply的用法
  2. join的用法
  3. value_counts()用法
  4. plot()用法
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
Casola, V., & Castiglione, A. (2020). Secure and Trustworthy Big Data Storage. Springer. Corriveau, D., Gerrish, B., & Wu, Z. (2020). End-to-end Encryption on the Server: The Why and the How. arXiv preprint arXiv:2010.01403. Dowsley, R., Nascimento, A. C. A., & Nita, D. M. (2021). Private database access using homomorphic encryption. Journal of Network and Computer Applications, 181, 103055. Hossain, M. A., Fotouhi, R., & Hasan, R. (2019). Towards a big data storage security framework for the cloud. In Proceedings of the 9th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, USA (pp. 402-408). Rughani, R. (2019). Analysis of Security Issues and Their Solutions in Cloud Storage Environment. International Journal of Computer Trends and Technology (IJCTT), 67(6), 37-42. van Esbroeck, A. (2019). Zero-Knowledge Proofs in the Age of Cryptography: Preventing Fraud Without Compromising Privacy. Chicago-Kent Journal of Intellectual Property, 19, 374. Berman, L. (2021). Watch out for hidden cloud costs. CFO Dive. Retrieved from https://www.cfodive.com/news/watch-out-for-hidden-cloud-costs/603921/ Bradley, T. (2021). Cloud storage costs continue to trend downward. Forbes. Retrieved from https://www.forbes.com/sites/tonybradley/2021/08/27/cloud-storage-costs-continue-to-trend-downward/?sh=6f9d6ade7978 Cisco. (2019). Cost optimization in the multicloud. Cisco. Retrieved from https://www.cisco.com/c/dam/en/us/solutions/collateral/data-center-virtualization/cloud-cost-optimization/cost-optimization_in_multicloud.pdf IBM. (2020). Storage efficiency solutions. IBM. Retrieved from https://www.ibm.com/blogs/systems/storage-efficiency-solutions/ Microsoft Azure. (n.d.). Azure Blob storage tiers. Microsoft Azure. Retrieved from https://azure.microsoft.com/en-us/services/storage/blobs/#pricing Nawrocki, M. (2019). The benefits of a hybrid cloud strategy for businesses. DataCenterNews. Retrieved from https://datacenternews.asia/story/the-benefits-of-a-hybrid-cloud-strategy-for,请把这一段reference list改为标准哈佛格式
05-29
Casola, V. & Castiglione, A. (2020) 'Secure and Trustworthy Big Data Storage', Springer. Corriveau, D., Gerrish, B. & Wu, Z. (2020) 'End-to-end Encryption on the Server: The Why and the How', arXiv preprint arXiv:2010.01403. Dowsley, R., Nascimento, A. C. A. & Nita, D. M. (2021) 'Private database access using homomorphic encryption', Journal of Network and Computer Applications, 181, p.103055. Hossain, M. A., Fotouhi, R. & Hasan, R. (2019) 'Towards a big data storage security framework for the cloud', in Proceedings of the 9th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, USA, pp. 402-408. Rughani, R. (2019) 'Analysis of Security Issues and Their Solutions in Cloud Storage Environment', International Journal of Computer Trends and Technology (IJCTT), 67(6), pp. 37-42. van Esbroeck, A. (2019) 'Zero-Knowledge Proofs in the Age of Cryptography: Preventing Fraud Without Compromising Privacy', Chicago-Kent Journal of Intellectual Property, 19, p.374. Berman, L. (2021) 'Watch out for hidden cloud costs', CFO Dive. [online] Available at: https://www.cfodive.com/news/watch-out-for-hidden-cloud-costs/603921/ (Accessed: 5 October 2021). Bradley, T. (2021) 'Cloud storage costs continue to trend downward', Forbes. [online] Available at: https://www.forbes.com/sites/tonybradley/2021/08/27/cloud-storage-costs-continue-to-trend-downward/?sh=6f9d6ade7978 (Accessed: 5 October 2021). Cisco. (2019) 'Cost optimization in the multicloud', Cisco. [online] Available at: https://www.cisco.com/c/dam/en/us/solutions/collateral/data-center-virtualization/cloud-cost-optimization/cost-optimization_in_multicloud.pdf (Accessed: 5 October 2021). IBM. (2020) 'Storage efficiency solutions', IBM. [online] Available at: https://www.ibm.com/blogs/systems/storage-efficiency-solutions/ (Accessed: 5 October 2021). Microsoft Azure. (n.d.) 'Azure Blob storage tiers', Microsoft Azure. [online] Available at: https://azure.microsoft.com/en-us/services/storage/blobs/#pricing (Accessed: 5 October 2021). Nawrocki, M. (2019) 'The benefits of a hybrid cloud strategy for businesses', DataCenterNews. [online] Available at: https://datacenternews.asia/story/the-benefits-of-a-hybrid-cloud-strategy-for (Accessed: 5 October 2021).

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值