数据分析学习——学术前沿趋势分析 任务2(论文作者统计)

任务2:论文作者统计

数据信息详见[Datawhale数据分析学习——学术前沿趋势分析 任务1]。(https://blog.csdn.net/weixin_37700945/article/details/112550261)

2.1任务统计说明

  • 任务主题:论文作者统计,统计所用论文作者出现频率Top10的姓名;
  • 任务内容:论文作者的统计、使用Pandas读取数据并使用字符串操作;
  • 任务成果: 学习Pandas的字符串操作;

2.2 数据处理步骤

  1. 数据读取

    选择authors(作者),categories(论文种类),authors_paresd(作者信息)三个字段进行读取;

  2. 数据统计

  • 统计所有作者姓名出现频率的Top10;
  • 统计所有作者姓(姓名最后的一个单词)的出现频率的Top10;
  • 统计所有作者姓第一个字符的频率;

注:由于本人计算机处理大量数据能力有限,为提升练习效率,统计样本选择论文类别“categories”为计算机语言“cs.CL”

2.3 具体代码实现

2.3.1 数据读取

#导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
import os
os.chdir("D:\数据分析\Datawhale项目")
data = []
with open("arxiv-metadata-oai-2019.json", 'r') as f: 
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {'authors': d['authors'], 'categories': d['categories'], 'authors_parsed': d['authors_parsed']}
        data.append(d)
        
data = pd.DataFrame(data)
data.head()
authorscategoriesauthors_parsed
0Sung-Chul Yoon, Philipp Podsiadlowski and Step...astro-ph[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
1B. Dugmore and PP. Ntumbamath.AT[[Dugmore, B., ], [Ntumba, PP., ]]
2T.V. Zaqarashvili and K Murawskiastro-ph[[Zaqarashvili, T. V., ], [Murawski, K, ]]
3Sezgin Aygun, Ismail Tarhan, Husnu Baysalgr-qc[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
4Antonio Pipino (1,3), Thomas H. Puzia (2,4), a...astro-ph[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...

2.3.2 数据统计

原始数据集中authors_parsed字段已经帮我们处理好了作者信息,可以直接使用该字段完成后续统计。将所有的作者姓名处理为一个list ,其中每个元素为一个作者的姓名。

首先,完成姓名频率的统计。

#拼接所有作者
data_p = data[data['categories'].apply(lambda x :'cs.CL' in x)]
all_authors = sum(data_p['authors_parsed'],[])
#拼接所有作者
authors_names = [' '.join(x) for x in all_authors]
authors_names = pd.DataFrame(authors_names)

print(authors_names)
                                                       0
0              Mokhov Serguei A.  for the MARF R&D Group
1               Sinclair Stephen  for the MARF R&D Group
2                    Clément Ian  for the MARF R&D Group
3       Nicolacopoulos Dimitrios  for the MARF R&D Group
4                                 Ferrer-i-Cancho Ramon 
...                                                  ...
20480  Rodriguez Horacio  Universitat Politecnica de ...
20481                                    Goerz Guenther 
20482                                     Spilker Joerg 
20483                                      Strom Volker 
20484                                        Weber Hans 

[20485 rows x 1 columns]

#根据作者频率绘制直方图
plt.figure(figsize = (10,6))
authors_names[0].value_counts().head(10).plot(kind = 'barh')
print(authors_names[0].value_counts().head(10).plot(kind = 'barh'))

#修改图配置
names = authors_names[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')
Nakov Preslav         48
Neubig Graham         36
Liu Ting              31
Gao Jianfeng          31
Zhao Hai              28
Gurevych Iryna        27
Liu Zhiyuan           25
Yu Dong               23
Wang William Yang     23
Watanabe Shinji       22
Name: 0, dtype: int64


Text(0.5, 0, 'Count')

2019年计算机语言类论文作者姓名出现频率为Top10


接下来统计所有作者的姓(authors_parsed字段中作者)第一个单词出现的频率:

authors_lastnames = [x[0] for x in all_authors]
authors_lastnames = pd.DataFrame(authors_lastnames)
print(authors_names[0].value_counts().head(10))

plt.figure(figsize=(10, 6))
authors_lastnames[0].value_counts().head(10).plot(kind='barh')

names = authors_lastnames[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')
print(sum(authors_lastnames[0].value_counts().head(10).tolist()))
Wang     529
Zhang    485
Li       427
Liu      426
Chen     337
Huang    191
Wu       181
Zhou     172
Yang     161
Xu       158
Name: 0, dtype: int64

Text(0.5, 0, 'Count')

2019年计算机语言类论文作者姓氏出现频率最高的Top10

3067

最后统计所有作者姓第一个字符出现的频率:

authors_lastnames_FL =[i[0] for i in [y[0] for y in all_authors]] #利用嵌套循环获取作者姓的第一个字符
authors_lastnames_FL = pd.DataFrame(authors_lastnames_FL)
print(authors_lastnames[0].value_counts().head(10))

plt.figure(figsize=(10, 6))
authors_lastnames_FL[0].value_counts().head(10).plot(kind='barh')

lastnames_FL = authors_lastnames_FL[0].value_counts().index.values[:10]
_lastnames_FL = plt.yticks(range(0, len(lastnames_FL)), lastnames_FL)
plt.ylabel('Author')
plt.xlabel('Count')
S    1934
L    1919
C    1444
M    1245
Z    1193
W    1174
H    1115
B    1085
K    1083
G    1046
Name: 0, dtype: int64

Text(0.5, 0, 'Count')

2019年计算机语言类论文作者姓氏首字母出现频率最高的Top10

2.3.3 拓展练习: 给条形图添加标签

可使用Python可视化text()函数。

matplotlib.pyplot.text(x, y, s, fontdict=None, withdash=False, **kwargs)

参数详解

  1. x,y : scalars 放置text的位置
  2. s : str 内容text
  3. fontdict : dictionary, optional, default: None 一个定义s格式的dict
  4. withdash : boolean, optional, default: False。如果True则创建一个 TextWithDash实例。

以下为其他常用参数:

  1. fontsize设置字体大小,默认12,可选参数 [‘xx-small’, ‘x-small’, ‘small’, ‘medium’, ‘large’,‘x-large’, ‘xx-large’]
  2. fontweight设置字体粗细,可选参数 [‘light’, ‘normal’, ‘medium’, ‘semibold’, ‘bold’, ‘heavy’, ‘black’]
  3. fontstyle设置字体类型,可选参数[ ‘normal’ | ‘italic’ | ‘oblique’ ],italic斜体,oblique倾斜
  4. verticalalignment设置水平对齐方式 ,可选参数 : ‘center’ , ‘top’ , ‘bottom’ ,‘baseline’
  5. horizontalalignment设置垂直对齐方式,可选参数:left,right,center
  6. rotation(旋转角度)可选参数为:vertical,horizontal 也可以为数字
  7. alpha透明度,参数值0至1之间
  8. backgroundcolor标题背景颜色
  9. bbox给标题增加外框 ,常用参数如下:
    1. boxstyle方框外形
    2. facecolor(简写fc)背景颜色
    3. edgecolor(简写ec)边框线条颜色
    4. edgewidth边框线条大小

具体操作

以出现频率为Top10 的论文作者的数据为例:

#根据作者频率绘制直方图
plt.figure(figsize = (10,6))
authors_names[0].value_counts().head(10).plot(kind = 'barh')

#修改图配置
names = authors_names[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')


counts = name_top10.tolist()#获取各数据横向坐标点
axis_x = list(range(0,10))#获取各数据纵向坐标点
#给图像打标签
for x,y in zip(counts,axis_x_coo):
    plt.text(x+0.5,y,x)

在这里插入图片描述

2.4 分析结论

通过上述分析过程以及获得的条形图可知以下关于cs.CL作者出现频率的信息:

  • 出现频率为Top10 的论文作者中,大多数为亚裔,其中6名华裔,1名日裔;
  • 频率最高的前三名仅有一位中国作者,位于第三名;
  • 出现频率最高的作者为Nakov Preslav,出现48次;
  • 在作者姓氏统计中,Top10都为中国姓氏,中国姓氏最多,前十位姓氏共出现3067次;
  • 在作者姓氏首字母统计中,S开头的最多,共出现1934次。
Through exposure to the news and social media, you are probably aware of the fact that machine learning has become one of the most exciting technologies of our time and age. Large companies, such as Google, Facebook, Apple, Amazon, and IBM, heavily invest in machine learning research and applications for good reasons. While it may seem that machine learning has become the buzzword of our time and age, it is certainly not a fad. This exciting field opens the way to new possibilities and has become indispensable to our daily lives. This is evident in talking to the voice assistant on our smartphones, recommending the right product for our customers, preventing credit card fraud, filtering out spam from our email inboxes, detecting and diagnosing medical diseases, the list goes on and on. If you want to become a machine learning practitioner, a better problem solver, or maybe even consider a career in machine learning research, then this book is for you. However, for a novice, the theoretical concepts behind machine learning can be quite overwhelming. Many practical books have been published in recent years that will help you get started in machine learning by implementing powerful learning algorithms. Getting exposed to practical code examples and working through example applications of machine learning are a great way to dive into this field. Concrete examples help illustrate the broader concepts by putting the learned material directly into action. However, remember that with great power comes great responsibility! In addition to offering a hands-on experience with machine learning using the Python programming languages and Python-based machine learning libraries, this book introduces the mathematical concepts behind machine learning algorithms, which is essential for using machine learning successfully. Thus, this book is different from a purely practical book; it is a book that discusses the necessary details regarding machine learning concepts and offers intuitive yet informative explanations of how machine learning algorithms work, how to use them, and most importantly, how to avoid the most common pitfalls. Currently, if you type "machine learning" as a search term in Google Scholar, it returns an overwhelmingly large number of publications—1,800,000. Of course, we cannot discuss the nitty-gritty of all the different algorithms and applications that have emerged in the last 60 years. However, in this book, we will embark on an exciting journey that covers all the essential topics and concepts to give you a head start in this field. If you find that your thirst for knowledge is not satisfied, this book references many useful resources that can be used to follow up on the essential breakthroughs in this field. If you have already studied machine learning theory in detail, this book will show you how to put your knowledge into practice. If you have used machine learning techniques before and want to gain more insight into how machine learning actually works, this book is for you. Don't worry if you are completely new to the machine learning field; you have even more reason to be excited. Here is a promise that machine learning will change the way you think about the problems you want to solve and will show you how to tackle them by unlocking the power of data. Before we dive deeper into the machine learning field, let's answer your most important question, "Why Python?" The answer is simple: it is powerful yet very accessible. Python has become the most popular programming language for data science because it allows us to forget about the tedious parts of programming and offers us an environment where we can quickly jot down our ideas and put concepts directly into action. We, the authors, can truly say that the study of machine learning has made us better scientists, thinkers, and problem solvers. In this book, we want to share this knowledge with you. Knowledge is gained by learning. The key is our enthusiasm, and the real mastery of skills can only be achieved by practice. The road ahead may be bumpy on occasions and some topics may be more challenging than others, but we hope that you will embrace this opportunity and focus on the reward. Remember that we are on this journey together, and throughout this book, we will add many powerful techniques to your arsenal that will help us solve even the toughest problems the data-driven way.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值