【学术前沿趋势分析】

最新推荐文章于 2022-01-21 13:20:14 发布

Chaossll

最新推荐文章于 2022-01-21 13:20:14 发布

阅读量975

点赞数

分类专栏：数据分析文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_44493291/article/details/112573603

版权

数据分析专栏收录该内容

1 篇文章

订阅专栏

该博客围绕计算机论文数据展开多项分析任务。包括统计2019年各方向论文数量、作者出现频率、代码相关统计，还进行论文种类分类和作者信息关联。运用了网络爬虫、Pandas、正则表达式等技术，涉及文本分类、图算法等知识。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学术前沿趋势分析

Task 01：论文数据统计

任务说明：统计2019年全年计算机各个方向的论文数量；

数据集准备
数据集来源数据集下载地址
数据集说明
格式说明：

id：arXiv ID，可用于访问论文；
submitter：论文提交者；
authors：论文作者；
title：论文标题；
comments：论文页数和图表等其他信息；
journal-ref：论文发表的期刊的信息；
doi：数字对象标识符，https://www.doi.org；
report-no：报告编号；
categories：论文在 arXiv 系统的所属类别或标签；
license：文章的许可证；
abstract：论文摘要；
versions：论文版本；
authors_parsed：作者的信息。

arxiv论文类别
类别信息地址
以下为部分类别展示

'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',

代码实现及分析
requirements

seaborn：0.9.0
BeautifulSoup：4.8.0
requests：2.22.0
json：0.8.5
pandas：0.25.1
matplotlib：3.1.1

导包

#导入所需的package
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt

读数据并打印大小和部分样例

# 读入数据
filepath = r'D:\Datasets\archive\arxiv-metadata-oai-snapshot.json'
data = []  # 初始化
# 使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open(filepath, 'r') as f:
    for line in f:
        data.append(json.loads(line))

data = pd.DataFrame(data)  # 将list变为dataframe格式，方便使用pandas进行分析
print(data.shape)  # 显示数据大小
print(data.head()) #显示数据的前五行
print(data["categories"].describe())

此处输出为

(1796911, 14)

 id  ...                                     authors_parsed
0  0704.0001  ...  [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
1  0704.0002  ...           [[Streinu, Ileana, ], [Theran, Louis, ]]
2  0704.0003  ...                                 [[Pan, Hongjun, ]]
3  0704.0004  ...                                [[Callan, David, ]]
4  0704.0005  ...  [[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]

[5 rows x 14 columns]
count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

筛选独立种类

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
len(unique_categories)
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
print(len(unique_categories))
print(unique_categories)

输出

176

{'q-fin.EC', 'math.SG', 'hep-ph', 'q-bio.QM', 'nucl-th', 'q-bio.MN', 'q-fin.GN', 'cs.RO', 'math.RA', 'gr-qc', 'cs.DM', 'cs.NI', 'astro-ph.SR', 'physics.flu-dyn', 'math.DS', 'math.GT', 'stat.AP', 'hep-lat', 'solv-int', 'econ.TH', 'astro-ph', 'math.GR', 'math.CA', 'q-fin.RM', 'math.MP', 'cs.GT', 'math.OC', 'math.AG', 'math.LO', 'math.NT', 'physics.ao-ph', 'acc-phys', 'cs.CL', 'eess.AS', 'math.AT', 'alg-geom', 'hep-th', 'cs.DC', 'cs.MM', 'cs.IR', 'cs.GL', 'cs.NA', 'q-fin.CP', 'cs.PL', 'physics.med-ph', 'math.KT', 'stat.ML', 'physics.optics', 'astro-ph.HE', 'physics.bio-ph', 'cs.GR', 'stat.ME', 'physics.acc-ph', 'physics.space-ph', 'physics.data-an', 'cs.OS', 'cs.MA', 'physics.ins-det', 'physics.gen-ph', 'astro-ph.GA', 'math.AC', 'q-fin.MF', 'cs.CE', 'physics.plasm-ph', 'cs.ET', 'cs.HC', 'eess.SY', 'mtrl-th', 'quant-ph', 'funct-an', 'econ.GN', 'stat.TH', 'cs.MS', 'physics.class-ph', 'cs.CR', 'q-bio.NC', 'cond-mat.supr-con', 'cs.CY', 'cond-mat.mes-hall', 'stat.CO', 'plasm-ph', 'q-bio.CB', 'nucl-ex', 'q-bio.BM', 'cond-mat.quant-gas', 'cs.SE', 'q-bio', 'comp-gas', 'cs.AI', 'physics.chem-ph', 'physics.atm-clus', 'math.HO', 'patt-sol', 'cond-mat', 'supr-con', 'q-bio.SC', 'cs.CG', 'cond-mat.mtrl-sci', 'math.SP', 'math.NA', 'physics.hist-ph', 'math.GN', 'nlin.CD', 'cs.NE', 'astro-ph.CO', 'adap-org', 'math.FA', 'physics.ed-ph', 'math-ph', 'cs.CV', 'stat.OT', 'physics.soc-ph', 'math.ST', 'eess.IV', 'math.DG', 'math.PR', 'math.QA', 'cs.LO', 'chao-dyn', 'cs.DS', 'nlin.CG', 'math.MG', 'cs.DB', 'cs.SD', 'astro-ph.IM', 'physics.app-ph', 'cond-mat.stat-mech', 'q-fin.ST', 'cs.LG', 'dg-ga', 'cond-mat.other', 'cs.IT', 'hep-ex', 'cond-mat.dis-nn', 'math.OA', 'q-fin.TR', 'cond-mat.soft', 'math.GM', 'cmp-lg', 'math.IT', 'eess.SP', 'math.AP', 'math.CO', 'cs.CC', 'chem-ph', 'math.CV', 'math.RT', 'q-alg', 'cond-mat.str-el', 'q-fin.PM', 'cs.SC', 'q-bio.TO', 'bayes-an', 'q-bio.GN', 'physics.pop-ph', 'cs.AR', 'nlin.SI', 'physics.atom-ph', 'cs.FL', 'cs.SI', 'q-bio.PE', 'cs.SY', 'physics.comp-ph', 'ao-sci', 'q-fin.PR', 'q-bio.OT', 'cs.PF', 'cs.OH', 'cs.DL', 'atom-ph', 'astro-ph.EP', 'nlin.PS', 'math.CT', 'physics.geo-ph', 'econ.EM', 'nlin.AO'}

筛选2019年后的论文

data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式，并提取处year
del data["update_date"] #删除 update_date特征，其使命已完成
data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据，并将其他数据删除
#data.groupby(['categories','year']) #以 categories 进行排序，如果同一个categories 相同则使用 year 特征进行排序
data.reset_index(drop=True, inplace=True) #重新编号
print(data) #查看结果

此处参考的文档还采用了网络爬虫的方式爬取了网上的数据集来与本地的数据集做交集，交叉验证，保证数据完整性

# 爬取所有的类别
website_url = requests.get('https://arxiv.org/category_taxonomy').text  # 获取网页的文本数据
soup = BeautifulSoup(website_url, 'lxml')  # 爬取数据，这里使用lxml的解析器，加速
root = soup.find('div', {'id': 'category_taxonomy_list'})  # 找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2", "h3", "h4", "p"], recursive=True)  # 读取 tags

# 初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

# 进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)", r"\2", raw)  # 正则表达式：模式字符串：(.*)\((.*)\)；被替换字符串"\2"；被处理字符串：raw
        level_2_name = re.sub(r"(.*)\((.*)\)", r"\1", raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)", r"\1", raw)
        level_3_name = re.sub(r"(.*) \((.*)\)", r"\2", raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)

# 根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name': level_1_names,
    'archive_name': level_2_names,
    'archive_id': level_2_codes,
    'category_name': level_3_names,
    'categories': level_3_codes,
    'category_description': level_3_notes

})

# 按照 "group_name" 进行分组，在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name", "archive_name"])
print(df_taxonomy)

re.sub(r"(.*)\((.*)\)",r"\2",raw)

#raw = Astrophysics(astro-ph)
#output = astro-ph

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

print(_df)

最后对数据绘图分析

fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1)
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")
print(cats)

饼状图

Task 02：论文作者统计

任务说明：

任务主题：论文作者统计，统计所有论文作者出现评率Top10的姓名；
任务内容：论文作者的统计、使用 Pandas
读取数据并使用字符串操作；任务成果：学习 Pandas 的字符串操作；

数据读取：

data = []
with open("arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {'authors': d['authors'], 'categories': d['categories'], 'authors_parsed': d['authors_parsed']}
        data.append(d)
        
data = pd.DataFrame(data)

数据统计：

统计所有作者姓名出现频率的Top10；
统计所有作者姓（姓名最后一个单词）的出现频率的Top10；
统计所有作者姓第一个字符的评率；

# 选择类别为cs.CV下面的论文
data2 = data[data['categories'].apply(lambda x: 'cs.CV' in x)]

# 拼接所有作者
all_authors = sum(data2['authors_parsed'], [])

处理完成后all_authors变成了所有一个list，其中每个元素为一个作者的姓名。我们首先来完成姓名频率的统计。

# 拼接所有的作者
authors_names = [' '.join(x) for x in all_authors]
authors_names = pd.DataFrame(authors_names)

# 根据作者频率绘制直方图
plt.figure(figsize=(10, 6))
authors_names[0].value_counts().head(10).plot(kind='barh')

# 修改图配置
names = authors_names[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')
plt.show()

注意此处要使用plt.show()把图片画出来
在这里插入图片描述

接下来统计姓名姓，也就是authors_parsed字段中作者第一个单词：

authors_lastnames = [x[0] for x in all_authors]
authors_lastnames = pd.DataFrame(authors_lastnames)

plt.figure(figsize=(10, 6))
authors_lastnames[0].value_counts().head(10).plot(kind='barh')

names = authors_lastnames[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')
plt.show()

在这里插入图片描述

Task 03：论文代码统计

任务说明

任务主题：论文代码统计，统计所有论文出现代码的相关统计；
任务内容：使用正则表达式统计代码连接、页数和图表数据；
任务成果：学习正则表达式统计；

数据处理步骤
在原始arxiv数据集中作者经常会在论文的comments或abstract字段中给出具体的代码链接，所以我们需要从这些字段里面找出代码的链接。

确定数据出现的位置；
使用正则表达式完成匹配；
完成相关的统计；

正则表达式教程

代码实现

数据读取

data  = [] #初始化
#使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open("arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {'abstract': d['abstract'], 'categories': d['categories'], 'comments': d['comments']}
        data.append(d)
        
data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析

###对pages进行抽取：
# 使用正则表达式匹配，XX pages
data['pages'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* pages', str(x)))

# 筛选出有pages的论文
data = data[data['pages'].apply(len) > 0]

# 由于匹配得到的是一个list，如['19 pages']，需要进行转换
data['pages'] = data['pages'].apply(lambda x: float(x[0].replace(' pages', '')))

print(data['pages'].describe().astype(int))

打印pages

count    1089180
mean          17
std           22
min            1
25%            8
50%           13
75%           22
max        11232
Name: pages, dtype: int32

分类统计页数并绘图

# 选择主要类别
data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])
data['categories'] = data['categories'].apply(lambda x: x.split('.')[0])

# 每类论文的平均页数
plt.figure(figsize=(12, 6))
data.groupby(['categories'])['pages'].mean().plot(kind='bar')
plt.show()

在这里插入图片描述

data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* figures', str(x)))
data = data[data['figures'].apply(len) > 0]
data['figures'] = data['figures'].apply(lambda x: float(x[0].replace(' figures', '')))

# 筛选包含github的论文
data_with_code = data[
    (data.comments.str.contains('github')==True)|(data.abstract.str.contains('github')==True)
]
data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')

# 使用正则表达式匹配论文
pattern = '[a-zA-z]+://github[^\s]*'
data_with_code['code_flag'] = data_with_code['text'].str.findall(pattern).apply(len)

data_with_code = data_with_code[data_with_code['code_flag'] == 1]
plt.figure(figsize=(12, 6))
data_with_code.groupby(['categories'])['code_flag'].count().plot(kind='bar')
plt.show()

抽取论文图标数目，并提取代码链接，此处只抽取了Github链接
在这里插入图片描述

Task 04：论文种类分类

任务说明

学习主题：论文分类（数据建模任务），利用已有数据建模，对新论文进行类别分类；
学习内容：使用论文标题完成类别分类；
学习成果：学会文本分类的基本方法、TF-IDF等；

数据处理步骤
在原始arxiv论文中论文都有对应的类别，而论文类别是作者填写的。在本次任务中我们可以借助论文的标题和摘要完成：

对论文标题和摘要进行处理；
对论文类别进行处理；
构建文本分类模型；

文本分类思路

思路1：TF-IDF+机器学习分类器
直接使用TF-IDF对文本提取特征，使用分类器进行分类，分类器的选择上可以使用SVM、LR、XGboost等

思路2：FastText
FastText是入门款的词向量，利用Facebook提供的FastText工具，可以快速构建分类器

思路3：WordVec+深度学习分类器
WordVec是进阶款的词向量，并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRnn或者BiLSTM。

思路4：Bert词向量
Bert是高配款的词向量，具有强大的建模学习能力。

思路一完整代码

# -*- coding: utf-8 -*-
import json
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

filepath = r'D:\Datasets\archive\arxiv-metadata-oai-snapshot.json'
data = []  # 初始化
# 使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open(filepath, 'r') as f:
    for idx, line in enumerate(f):
        d = json.loads(line)
        d = {'title': d['title'], 'categories': d['categories'], 'abstract': d['abstract']}
        data.append(d)

        # 选择部分数据
        if idx > 200000:
            break

data = pd.DataFrame(data)  # 将list变为dataframe格式，方便使用pandas进行分析

data['text'] = data['title'] + data['abstract']

data['text'] = data['text'].apply(lambda x: x.replace('\n',' '))
data['text'] = data['text'].apply(lambda x: x.lower())
data = data.drop(['abstract', 'title'], axis=1)

# 多个类别，包含子分类
data['categories'] = data['categories'].apply(lambda x : x.split(' '))

# 单个类别，不包含子分类
data['categories_big'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x])


mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_big'].iloc[:])


vectorizer = TfidfVectorizer(max_features=4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])
# 划分训练集和验证集

x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,
                                                 test_size = 0.2,random_state = 1)

# 构建多标签分类模型

clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)

#验证模型的精度：
print(classification_report(y_test, clf.predict(x_test)))

Result

 precision    recall  f1-score   support

           0       0.95      0.85      0.90      7872
           1       0.85      0.78      0.81      7329
           2       0.77      0.72      0.74      2970
           3       0.00      0.00      0.00         2
           4       0.72      0.47      0.57      2149
           5       0.51      0.67      0.58       993
           6       0.89      0.35      0.50       538
           7       0.71      0.68      0.70      3657
           8       0.75      0.62      0.68      3382
           9       0.85      0.88      0.86     10809
          10       0.41      0.11      0.18      1796
          11       0.80      0.04      0.07       737
          12       0.44      0.33      0.38       540
          13       0.52      0.34      0.41      1070
          14       0.70      0.15      0.25      3435
          15       0.83      0.19      0.31       687
          16       0.88      0.18      0.30       249
          17       0.89      0.43      0.58      2565
          18       0.79      0.36      0.49       689

   micro avg       0.81      0.65      0.72     51469
   macro avg       0.70      0.43      0.49     51469
weighted avg       0.80      0.65      0.69     51469
 samples avg       0.72      0.72      0.70     51469

Task：5：作者信息关联

任务说明

学习主题：作者关联（数据建模任务），对论文作者关系进行建模，统计最常出现的作者关系；
学习内容：构建作者关系图，挖掘作者关系
学习成果：论文作者知识图谱、图关系挖掘

数据处理步骤
将作者列表进行处理，并完成统计。具体步骤如下：

将论文第一作者与其他作者（论文非第一作者）构建图；
使用图算法统计图中作者与其他作者的联系

社交网络分析

图是复杂网络研究中的一个重要概念。Graph是用点和线来刻画离散事物集合中的每对事物间以某种方式相联系的数学模型。Graph在现实世界中随处可见，如交通运输图、旅游图、流程图等。利用图可以描述现实生活中的许多事物，如用点可以表示交叉口，点之间的连线表示路径，这样就可以轻而易举的描绘出一个交通运输网络。

图类型

无向图，忽略了两节点间边的方向。
指有向图，考虑了边的有向性。
多重无向图，即两个结点之间的边数多于一条，又允许顶点通过同一条边和自己关联。

图统计指标

度：是指和该节点相关联的边的条数，又称关联度。对于有向图，节点的入度是指进入该节点的边的条数；节点的出度是指从该节点出发的边的条数；
迪杰斯特拉路径：.从一个源点到其它各点的最短路径，可使用迪杰斯特拉算法来求最短路径；
连通图：在一个无向图 G 中，若从顶点i到顶点j有路径相连，则称i和j是连通的。如果 G 是有向图，那么连接i和j的路径中所有的边都必须同向。如果图中任意两点都是连通的，那么图被称作连通图。如果此图是有向图，则称为强连通图。

对于其他图算法，可以在networkx和igraph两个库中找到。

# -*- coding: utf-8 -*-
import json
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

## 读取数据

filepath = r'D:\Datasets\archive\arxiv-metadata-oai-snapshot.json'
data = []  # 初始化
# 使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open(filepath, 'r') as f:
   for idx, line in enumerate(f):
       d = json.loads(line)
       d = {'authors_parsed': d['authors_parsed']}
       data.append(d)

data = pd.DataFrame(data)  # 将list变为dataframe格式，方便使用pandas进行分析

#创建作者链接的无向图：


# 创建无向图
G = nx.Graph()

# 只用五篇论文进行构建
for row in data.iloc[:5].itertuples():
   authors = row[1]
   authors = [' '.join(x[:-1]) for x in authors]

   # 第一个作者 与 其他作者链接
   for author in authors[1:]:
       G.add_edge(authors[0], author)  # 添加节点２，３并链接２３节点

nx.draw(G, with_labels=True)

try:
   print(nx.dijkstra_path(G, 'Balázs C.', 'Ziambaras Eleni'))
except:
   print('No path')


# 计算论文关系中有多少个联通子图
print(len(nx.communicability(G)))

degree_sequence = sorted([d for n, d in G.degree()], reverse=True)
dmax = max(degree_sequence)
plt.loglog(degree_sequence, "b-", marker="o")
plt.title("Degree rank plot")
plt.ylabel("degree")
plt.xlabel("rank")

# draw graph in inset
plt.axes([0.45, 0.45, 0.45, 0.45])
Gcc = G.subgraph(sorted(nx.connected_components(G), key=len, reverse=True)[0])

pos = nx.spring_layout(Gcc)
plt.axis("off")
nx.draw_networkx_nodes(Gcc, pos, node_size=20)
nx.draw_networkx_edges(Gcc, pos, alpha=0.4)
plt.show()