学术前沿趋势分析_学习_论文数据统计Task1

最新推荐文章于 2024-07-12 14:26:59 发布

小王做笔记

最新推荐文章于 2024-07-12 14:26:59 发布

阅读量301

点赞数 2

分类专栏：数据分析与挖掘__学术前沿趋势分析文章标签：数据分析数据挖掘 pandas 爬虫

本文链接：https://blog.csdn.net/qsx123432/article/details/112403808

版权

数据分析与挖掘__学术前沿趋势分析专栏收录该内容

5 篇文章 1 订阅

订阅专栏

论文数据统计学习记录

任务说明
分析说明
项目演练

任务说明

任务主题：论文数量统计，即统计2019年全年计算机各个方向论文数量；

任务内容：赛题的理解、使用 Pandas 读取数据并进行统计；

任务成果：学习 Pandas 的基础操作；

分析说明

我们使用的数据是：2019年全年计算机各个方向论文数量。那么问题来了？数据在哪里？怎么获得？

1. 获得方法：使用python 爬虫爬取数据集

因此便会使用到：

import requests #用于网络连接，发送网络请求，使用域名获取对应信息
from bs4 import BeautifulSoup #用于爬取arxiv的数据

通过爬虫爬取网页上指定文件
参考文章（关于python爬取网页上指定内容）

如果没有限制的进行爬虫，势必会加重后面的数据处理的工作，影响分析思路，那么我们应怎样对爬取数据进行限制呢？需要用到什么库呢？

2. 这里使用正则化加以限制

import re #用于正则表达式，匹配字符串的模式

3. 判断数据是否存在

当我们获得数据之后，第二次再次执行文件的时候，如果再重复爬虫爬取数据的工作，那相当于把重复做了此类工作，不仅会占用内存，还会浪费不必要的时间。那么如何解决此类情况呢？如何判断该文件已经存在呢？（= 对于已经存在的文件进行直接读取操作，如果不存在，那么就进行爬虫获取文件， 其中要包含对文件是否存在进行判定）

可以使用os.path.isfile()或pathlib.Path.is_file() 来检查文件是否存在。

暂时写的一个例子，后面可能会修改：
这里并不配套

import os
import numpy as np

train_path = './fashion_image_label/fashion_train_jpg_60000/'
train_txt = './fashion_image_label/fashion_train_jpg_60000.txt'
x_train_savepath = './fashion_image_label/fashion_x_train.npy'
y_train_savepath = './fashion_image_label/fahion_y_train.npy'


def generateds(path, txt):
    f = open(txt, 'r')
    contents = f.readlines()  # 按行读取
    f.close()
    x, y_ = [], []
    for content in contents:
        value = content.split()  # 以空格分开，存入数组
        img_path = path + value[0]
        img = Image.open(img_path)
        img = np.array(img.convert('L'))
        img = img / 255.
        x.append(img)
        y_.append(value[1])
        print('loading : ' + content)

    x = np.array(x)
    y_ = np.array(y_)
    y_ = y_.astype(np.int64)
    return x, y_


if os.path.exists(x_train_savepath) and os.path.exists(y_train_savepath) and os.path.exists(
        x_test_savepath) and os.path.exists(y_test_savepath):
    print('-------------Load Datasets-----------------')
    x_train_save = np.load(x_train_savepath)
    y_train = np.load(y_train_savepath)
    x_test_save = np.load(x_test_savepath)
    y_test = np.load(y_test_savepath)
    x_train = np.reshape(x_train_save, (len(x_train_save), 28, 28))
    x_test = np.reshape(x_test_save, (len(x_test_save), 28, 28))
else:
    print('-------------Generate Datasets-----------------')
    x_train, y_train = generateds(train_path, train_txt)
    x_test, y_test = generateds(test_path, test_txt)

    print('-------------Save Datasets-----------------')
    x_train_save = np.reshape(x_train, (len(x_train), -1))
    x_test_save = np.reshape(x_test, (len(x_test), -1))
    np.save(x_train_savepath, x_train_save)
    np.save(y_train_savepath, y_train)
    np.save(x_test_savepath, x_test_save)
    np.save(y_test_savepath, y_test)

参考文章（用python查看文件是否存在的三种方式）
（如何检查文件是否存在）

4. json文件的设置及使用方法

当我们爬取到文件之后应该把文件保存为哪种格式? 不同的文件类型，采用不同的读取方法。

具体如下：

这里我们将文件保存为 .json 格式。下面有关该文件的一些知识：

参考文章（json文件格式详解）
（json文件的读取与写入）

举例说明：

4.1. 用json读取字符串文件

import json  # 读取数据，我们的数据为json格式的

str='''[{"name":"kingsan",
        "age":'23'},
        {"name":"xiaolan",
        "age":"23"}]
'''
print(type(str))
data = json.loads(str)
print(data)
print(type(data))

JSONDecodeError: Expecting value: line 2 column 15 (char 34)

运行时发现错误：

解决办法：
原来数据格式里string类型的数据要用双引号，而不是单引号。

修改之后：

import json
str='''[{"name":"kingsan",
        "age":"23"},
        {"name":"xiaolan",
        "age":"23"}]
'''
print(type(str))
data = json.loads(str)
print(data)
print(type(data))

<class 'str'>
[{'name': 'kingsan', 'age': '23'}, {'name': 'xiaolan', 'age': '23'}]
<class 'list'>

参考文章（json.decoder.JSONDecodeError: Expecting value错误的解决方法）

4.2. 用json读取文本文件

大概的形式为：

import json
with open('data.json','r') as file:
    str = file.read()
    data = json.loads(str)
    print(data)

具体案例：

import json
data =[{
'name':'kingsan',
'age':'23'
}]

with open('data.json','w') as file:
	file.write(json.dumps(data))
    
with open('data.json', "r") as f:
    for idx, col in enumerate(f):
        print(idx)   # 显示的行标签
        print(col)   # 显示某一行的内容

4.3. 把字典写入json文件

（参考）

import json
data ={
      "grade1": {'name':'kingsan','age':'23'},
      "grade2": {"name": "xiaoliu","age":"24"},
      "grade3": {"name":"xiaowang","age":"22"}}


with open('data.json','w') as file:   # 写文件
	file.write(json.dumps(data))
    
with open('data.json', "r") as f:     # 读文件
    for idx, line in enumerate(f):
#         pass
        print(idx)
        print(line)

print(type(line))

0
{"grade1": {"name": "kingsan", "age": "23"}, "grade2": {"name": "xiaoliu", "age": "24"}, "grade3": {"name": "xiaowang", "age": "22"}}
str

然后我们的目标是学习pandas的基础操作，会用pandas进行一定的数据统计和分析

那接下来便是正式的项目演练：

项目演练

1. 导入库

# 爬取数据的库
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import requests #用于网络连接，发送网络请求，使用域名获取对应信息

# 限制数据格式的库
import re #用于正则表达式，匹配字符串的模式

# 数据保存和读取的库
import json #读取数据，我们的数据为json格式的

# 数据分析用的
import pandas as pd #数据处理，数据分析

# 数据可视化用的
import matplotlib.pyplot as plt #画图工具
import seaborn as sns #用于画图

2. 读取数据

# 读入数据
data  = []

#使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open("arxiv-metadata-oai-2019.json", 'r') as f: 
    for idx, line in enumerate(f): 
        
        # 读取前100行，如果读取所有数据需要8G内存
        if idx >= 100:
            break
        
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析
data.shape #显示数据大小

此处应注意：我们用的文件是 arxiv-metadata-oai-2019.json ，这个文件要和代码文件在一起，如果没有在一起的话，两种处理方式：
（1）手动复制或剪切到指定位置
（2）通过定义path 变量进行操作

data=[]  # 如果重新读取，注意在此读入是要初始化data

path = r"F:\Python_Tensorflow_codes\006group_learning\team-learning-data-mining-master\AcademicTrends\arxiv-metadata-oai-2019.json"
with open(path, 'r') as f: 
    for idx, line in enumerate(f): 
        
        # 读取前100行，如果读取所有数据需要8G内存
        if idx >= 100:
            break
        
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析
data.shape #显示数据大小

enumerate 函数

这里的 enumerate 函数的用法（链接）
json文件的读取与写入

3. 读取数据之后，对存储数据的变量进行概览

data.head() #显示数据的前五行

4. 查看文件的类别并选出与任务相关的类别（2019年论文数量统计）

def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'], count=None):
    '''
    定义读取文件的函数
        path: 文件路径
        columns: 需要选择的列
        count: 读取行数
    '''
    
    data  = []
    with open(path, 'r') as f: 
        for idx, line in enumerate(f): 
            if idx == count:
                break
                
            d = json.loads(line)     # line是 str 格式，将str格式的数据通过 json.loads() 加载进来, 输出为 字典类型
            d = {col : d[col] for col in columns}
            data.append(d)

    data = pd.DataFrame(data)
    return data

data = readArxivFile('arxiv-metadata-oai-2019.json', ['id', 'categories', 'update_date'])  # 这个方法是为了选出跟任务相关的类别数据

print(data.head())

 d = json.loads(line)     # line是 str 格式，将str格式的数据通过 json.loads() 加载进来, 输出为 字典类型
 d = {col : d[col] for col in columns}
 data.append(d)

上面的这3步挺受用的：

第一步：将 enumerate(f) 的输出line（为字符串类型的数据）转换成 d （为字典类型）
这样我们就可以通过键和值对数据进行排列，提取
第二步：通过字典表达式，将数据进行一个一个提取。
第三步：将提取到的项目不断添加到数据变量中。

5. 数据预处理

5.1 数据的种类信息

data["categories"].describe()

count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

在这里我们要判断只出现一种的数据种类

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
len(unique_categories)
unique_categories

5.1.1列表生成式的使用方法

在这里用到的是列表生成式

为什么要用列表生成式

执行顺序：
在这里插入图片描述

我们的目的是为了找出独有的种类。这里通过遍历的方式去找

5.1.2 列表生成式的理解：

a = [x.split(' ') for x in data["categories"]]

for l in a:
    for i in l:
        unique_categories = set([i])

print(unique_categories)

当我去这样做的时候:
运行解和列表生成式中的不一样。

{'nlin.SI'}

首先我们的目的是，把种类类别提取出来（并保证唯一性。）
查看数据：

data["categories"]

0                                         astro-ph
1                                          math.AT
2                                         astro-ph
3                                            gr-qc
4                                         astro-ph
                            ...                   
170613                                    quant-ph
170614                            solv-int nlin.SI
170615                            solv-int nlin.SI
170616    solv-int adap-org hep-th nlin.AO nlin.SI
170617                            solv-int nlin.SI
Name: categories, Length: 170618, dtype: object

从上面看出：第一列是下标，第二列是种类信息。

print(type(data["categories"]))

<class 'pandas.core.series.Series'>

使用unique()函数，将唯一种类取出，并且 pd.unique()函数的输出是numpy.ndarray（一个数组类型的数据）

#### 
data_unique_cate = data["categories"].unique()   # data_unique_cate = pd.unique(data["categories"])
print(type(data_unique_cate))
print(data_unique_cate)

#### 从这里面不断提取数据，然后构成一个集合
list(data_unique_cate)[:10]

['astro-ph',
 'math.AT',
 'gr-qc',
 'nucl-ex',
 'quant-ph',
 'math.DG',
 'hep-ex',
 'cond-mat.str-el cond-mat.mes-hall',
 'math.CA',
 'math.DG math.AG']

到此处我们发现，data_unique_cate（是字符串类型的列表）中有一些元素（字符串类型）是一块的。比如

 'cond-mat.str-el cond-mat.mes-hall'

也就是说，一个元素有两个小元素组成，那么这些小元素可能会有重复的，因为 unique() 只检查的是data_unique_cate中的元素值。

我们在观察：

 'cond-mat.str-el cond-mat.mes-hall'

这两个小元素是通过空格分开的，那么我们可以采用split() 方法，将他们分开

注意：
.split() 方法是字符串分割方法，分割后的结果是列表形式
参考（ Python split()方法）
首先我们要提取这些元素，这些元素是字符串类型的，且小元素是通过空格隔开的。

那么进行如下操作：

for i in range(len(data_unique_cate[:10])):
    x = data_unique_cate[i]
    print(x)

astro-ph
math.AT
gr-qc
nucl-ex
quant-ph
math.DG
hep-ex
cond-mat.str-el cond-mat.mes-hall
math.CA
math.DG math.AG

如果直接从for i 循环中，对x进行分割那么势必会出现：

for i in range(len(data_unique_cate[:10])):
    x = data_unique_cate[i]
#     print(x)
    x_sp = x.split(" ")
    print(x_sp)

['astro-ph']
['math.AT']
['gr-qc']
['nucl-ex']
['quant-ph']
['math.DG']
['hep-ex']
['cond-mat.str-el', 'cond-mat.mes-hall']
['math.CA']
['math.DG', 'math.AG']

这样就不对了，因为每个元素现在的类型是一个列表，列表中存放的是字符串类型的元素。并且没有将前面提到的小元素进行分割

再次尝试：

for i in range(len(data_unique_cate[:10])):
    x = data_unique_cate[i]
#     print(x)
    x_sp = x[i].split(" ")
    print(x_sp)

['a']
['a']
['-']
['l']
['t']
['D']
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-58-a96f3620fe05> in <module>
     30     x = data_unique_cate[i]
     31 #     print(x)
---> 32     x_sp = x[i].split(" ")
     33     print(x_sp)
     34 #     for l in x.split(" "):

IndexError: string index out of range

这样的方法更不对了，因为它将小元素每个字符都进行分割了。

总结：首先，我们报保证每个元素是独立的，其次，要把整个元素取出来，然后通过空格分开。

具体做法：

for i in range(len(data_unique_cate[:10])):
    x = data_unique_cate[i]
#     print(x)
#     x_sp = x[i].split(" ")  # 错误示范
#     print(x_sp)     # 错误示范
    for l in x.split(" "):   # 此处的 x 是一个元素
        element = l         # 这里的element是 x 根据 空格 分割后的小元素
        print(element)

astro-ph
math.AT
gr-qc
nucl-ex
quant-ph
math.DG
hep-ex
cond-mat.str-el
cond-mat.mes-hall
math.CA
math.DG
math.AG

这一个工作，实现了将元素逐个取出，然后将元素分割成小元素。

然后把新生的 element 组合起来生成一个新的列表，然后将其转换为集合类型

list_name = []
for i in range(len(data_unique_cate[:10])):
    x = data_unique_cate[i]
#     print(x)
#     x_sp = x[i].split(" ")  # 错误示范
#     print(x_sp)     # 错误示范
    for l in x.split(" "):   # 此处的 x 是一个元素
        element = l         # 这里的element是 x 根据 空格 分割后的小元素
#         print(element)
        list_name.append(element)
set(list_name)

{'astro-ph',
 'cond-mat.mes-hall',
 'cond-mat.str-el',
 'gr-qc',
 'hep-ex',
 'math.AG',
 'math.AT',
 'math.CA',
 'math.DG',
 'nucl-ex',
 'quant-ph'}

以上是对前10行的数据进行操作。

接下来就对所有数据进行操作：

list_name = []
for i in range(len(data_unique_cate)):
    x = data_unique_cate[i]
#     print(x)
#     x_sp = x[i].split(" ")  # 错误示范
#     print(x_sp)     # 错误示范
    for l in x.split(" "):   # 此处的 x 是一个元素
        element = l         # 这里的element是 x 根据 空格 分割后的小元素
#         print(element)
        list_name.append(element)
    
# print(list_name)   # 这种直接打印的方法，执行速度慢，占用内存空间大
set(list_name)   # 这里看出来，python中的set() 方法可以对元素自动排序， 这种排序方式是一种假象。

执行结果：太长省略

注意：
set() 方法是一种将数据转化为集合类型的方法，你看它进行了“排序”（带有引号的假象排序），其实不是这样的

s1 = {7,2,2,1,6, 9,3}
s2 = {4, 2, 1, 7, 2,52,1,4,7,1,14,7,21,33,4, 37}
print('s1', s1)
print('s2', s2)

s1 {1, 2, 3, 6, 7, 9}
s2 {1, 2, 33, 4, 37, 7, 14, 52, 21}

这样之后我们在来看这个列表生成式（执行效率最高）

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])

怎么写出来的：

for x in data["categories"]  遍历所有类别，（其中x 为 类别变量），
然后通过 split("  ") 对每个元素进行分割, 并把所有分割后的小元素放到 [ ] 中括号中，构成列表
再然后 用 for l in  上面的列表
最后 用 for i  in  l 取出每个小元素
然后放到 [] 中，然后转换成集合形式

注意：原程序中写的 set() 是进行去重操作的，从而选取的是 data[“categories”]

5.2 数据时间特征预处理

我们的任务要求对于2019年以后的paper进行分析，所以首先对于时间特征进行预处理，从而得到2019年以后的所有种类的论文：

（从这里看出，始终围绕着问题要求去展开任务。）

data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式，并提取处year
del data["update_date"] #删除 update_date特征，其使命已完成
data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据，并将其他数据删除

# data.groupby(['categories','year']) #以 categories 进行排序，如果同一个categories 相同则使用 year 特征进行排序
data.reset_index(drop=True, inplace=True) #重新编号
data #查看结果

下面我们将通过这段代码复习一下：

5.2.1 如何使用pandas处理时间序列（简洁版）

若想详细了解，参考文章：
Python——Pandas 时间序列数据处理

明确任务：我们要统计2019年的论文数量。我们已经知道了论文的种类，那么我们要从众多的时间中选出时间为 2019 的数据

1. 首先查看表示时间的数据类型，并进行提取：

print(type(data["update_date"]))
print("-"*10)
print(data["update_date"])
print("*"*10)
print(type(data["update_date"][0]))

<class 'pandas.core.series.Series'>
----------
0         2019-08-19
1         2019-08-19
2         2019-08-19
3         2019-10-21
4         2019-08-19
             ...    
170613    2019-08-17
170614    2019-08-15
170615    2019-08-17
170616    2019-08-17
170617    2019-08-21
Name: update_date, Length: 170618, dtype: object
**********
<class 'str'>

进行观察：

1. 表示时间的类别是 update_date
2. 时间数据存放在data["update_date"]中，格式为 pd.Series格式
3. 时间数据中的元素目前是 str 类型， 也就是说，现在还不是时间数据，现在是字符串数据，长得像时间而已
4. 并且这种时间字符，包含 ： 年月日

针对上面的观察，我们进行如下处理：

首先我们要**将字符串类型表示时间的数据转化成时间序列**
参考文章:

pandas 字符串类型转换成时间类型 object to datetime64[ns]

Pandas 将DataFrame字符串日期转化成时间类型日期(这两篇文章包含时间转化，日期加1，时间日期的年月日格式抽取)

   这里使用 pd.to_datetime() 方法

data_time = pd.to_datetime(data["update_date"])   #  data["year"] = pd.to_datetime(data["update_date"].values)
data_time

0        2019-08-19
1        2019-08-19
2        2019-08-19
3        2019-10-21
4        2019-08-19
            ...    
170613   2019-08-17
170614   2019-08-15
170615   2019-08-17
170616   2019-08-17
170617   2019-08-21
Name: year, Length: 170618, dtype: datetime64[ns]

此时已经变成了时间数据，那么我们还要把时间数据中的年份提取出来

使用  pandas.Series.dt.year    其中  dt = datetime  这样记起来方便

data["year"] = data_time.dt.year
print(data["year"])


0         2019
1         2019
2         2019
3         2019
4         2019
          ... 
170613    2019
170614    2019
170615    2019
170616    2019
170617    2019
Name: year, Length: 170618, dtype: int64

为了避免时间的冲突，我们把原来的时间特征进行删除

print(data.head())  # 删除之前的数据，用于对比
del data["update_date"]  # 在原有数据集上进行操作
print(data.head())  # 删除之后的数据

          id categories update_date  year
0  0704.0297   astro-ph  2019-08-19  2019
1  0704.0342    math.AT  2019-08-19  2019
2  0704.0360   astro-ph  2019-08-19  2019
3  0704.0525      gr-qc  2019-10-21  2019
4  0704.0535   astro-ph  2019-08-19  2019
          id categories  year
0  0704.0297   astro-ph  2019
1  0704.0342    math.AT  2019
2  0704.0360   astro-ph  2019
3  0704.0525      gr-qc  2019
4  0704.0535   astro-ph  2019

注意， del data[“update_date”] 中的 del 是在原数据上进行的修改，如果执行一次之后，在执行这条语句的话，会报错，因为原来的数据已经删除了，找不到了
报错显示：

KeyError: 'update_time'

接着我们进行筛选，把2019年的数据选出来

data = data[data["year"] == 2019] #找出 year 中2019年的数据，并将其他数据删除
print(data.head())

          id categories  year
0  0704.0297   astro-ph  2019
1  0704.0342    math.AT  2019
2  0704.0360   astro-ph  2019
3  0704.0525      gr-qc  2019
4  0704.0535   astro-ph  2019

选出时间是2019年的数据, 并将其结果重新赋值给 data 变量, 其中 data 是 DataFrame格式

别嫌自己啰嗦：重申目标（统计2019年论文数据）

data.groupby(['categories','year']) #以 categories 进行排序，如果同一个categories 相同则使用 year 特征进行排序
data.reset_index(drop=True, inplace=True) #重新编号
data #查看结果


id	categories	year
0	0704.0297	astro-ph	2019
1	0704.0342	math.AT	2019
2	0704.0360	astro-ph	2019
3	0704.0525	gr-qc	2019
4	0704.0535	astro-ph	2019
...	...	...	...
170613	quant-ph/9904032	quant-ph	2019
170614	solv-int/9511005	solv-int nlin.SI	2019
170615	solv-int/9809008	solv-int nlin.SI	2019
170616	solv-int/9909010	solv-int adap-org hep-th nlin.AO nlin.SI	2019
170617	solv-int/9909014	solv-int nlin.SI	2019
170618 rows × 3 columns

2.在没有已知类别的情况下，从数据集中找符合某个类别的种类

从现在开始，我们得到了2019年所有的论文种类数据，那么下面选出计算机领域中的所有文章。
说明：我们要把计算机领域相关的字符找出来，然后在这里进行匹配选择

那么怎么找类别呢？
2. 1 通过爬虫爬取所有类别：

#爬取所有的类别
website_url = requests.get('https://arxiv.org/category_taxonomy').text #获取网页的文本数据
soup = BeautifulSoup(website_url,'lxml') #爬取数据，这里使用lxml的解析器，加速
root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags

#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

#进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式：模式字符串：(.*)\((.*)\)；被替换字符串"\2"；被处理字符串：raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)

#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组，在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy

	group_name	archive_name	archive_id	category_name	categories	category_description
0	Computer Science	Computer Science	Computer Science	Artificial Intelligence	cs.AI	Covers all areas of AI except Vision, Robotics...
1	Computer Science	Computer Science	Computer Science	Hardware Architecture	cs.AR	Covers systems organization and hardware archi...
2	Computer Science	Computer Science	Computer Science	Computational Complexity	cs.CC	Covers models of computation, complexity class...
3	Computer Science	Computer Science	Computer Science	Computational Engineering, Finance, and Science	cs.CE	Covers applications of computer science to the...
4	Computer Science	Computer Science	Computer Science	Computational Geometry	cs.CG	Roughly includes material in ACM Subject Class...
...	...	...	...	...	...	...
150	Statistics	Statistics	Statistics	Computation	stat.CO	Algorithms, Simulation, Visualization
151	Statistics	Statistics	Statistics	Methodology	stat.ME	Design, Surveys, Model Selection, Multiple Tes...
152	Statistics	Statistics	Statistics	Machine Learning	stat.ML	Covers machine learning papers (supervised, un...
153	Statistics	Statistics	Statistics	Other Statistics	stat.OT	Work in statistics that does not fit into the ...
154	Statistics	Statistics	Statistics	Statistics Theory	stat.TH	stat.TH is an alias for math.ST. Asymptotics, ...
155 rows × 6 columns

爬虫这部分没怎么明白，先空着，等考完试进行补充。

3. 数据可视化

我们通过爬虫得到了数据，那么要类别进行筛选，并和 .json文件中的种类进行匹配，进行统计

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df

这里看是能看懂，但如果自己写的话，不会这么写，接下来是，个人对于这段代码的理解

先确定一下目的：我们要对种类（类别）进行统计

输入是：前面通过爬虫爬到的数据，那么回过头去，去看一下爬取的数据（155行 * 6列）
在这里插入图片描述
我们现在要做的是：名字一样的进行统计
然后在把名字一样，细分领域也一样的进行统计，
然后在统计计算机大类，细分领域在2019年的论文数量，并完成可视化（用饼状图描述占比）

下面逐一进行

3.1 名字一样的大类别的统计

df_taxonomy.groupby("group_name").count()

	archive_name	archive_id	category_name	categories	category_description
group_name					
Computer Science	40	40	40	40	40
Economics	3	3	3	3	3
Electrical Engineering and Systems Science	4	4	4	4	4
Mathematics	32	32	32	32	32
Physics	51	51	51	51	51
Quantitative Biology	10	10	10	10	10
Quantitative Finance	9	9	9	9	9
Statistics	6	6	6	6	6
1

这样统计的结果仅仅是爬虫中的大型类别
各个大类的数量，一共有 155行，8 大类，
显然这样是没有和之前的论文数据结合起来的

那么怎样结合？如何把两个数据结合起来（连接起来，按照两个数据共有的类别进行连接）？

使用merge函数

data.merge(df_taxonomy, on="categories", how="left")

这里只是单纯的连接起来了，然后在把重复的id去除（我们统计的是种类，不是个数）

data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id", "group_name"])

就按照group_name大类名，统计id个数（此时已经去重了，所以一个类就是一个名）
其中使用了聚合函数 agg({“针对哪个”:“使用什么方法”})

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df


group_name	id
0	Physics	79985
1	Mathematics	51567
2	Computer Science	40067
3	Statistics	4054
4	Electrical Engineering and Systems Science	3297
5	Quantitative Biology	1994
6	Quantitative Finance	826
7	Economics	576

此时统计大类完成

3.2 对大类进行可视化

fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) 
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

这一部分直接写程序就行，具体函数的使用方法可以参考（https://www.jb51.net/article/100389.htm）
参数的含义：python中使用matplotlib绘制饼状图

下面对绘图的使用方法和参数进行一定的总结：

1. 首先 导入需要的库  import maplotlib.pyplot as plt
2. 明确一下是否要 设置中文字体显示  plt.rcParams['font.sans-serif'] = ['SimHei']  # 设置字体，解决中文无法显示的问题
3. 先写出 plt.pie(数值大小, label = 对应标签, color="颜色", explode=[扇形部分之间的空隙1, 扇形部分之间的空隙2, ……]， autopic="各部分的占比显示情况") 这里的参数需要事先设置好。
4. plt.legend(名称, loc = "位置")  添加图例，显示标签
5. plt.show()

注意如果有中文字体无法显示的情况，那么可以查看（Python 3下Matplotlib画图中文显示乱码的解决方法）

注意这里的绘制饼状图的方法也可以迁移到绘制其他图形上

3.3 对大类中的小类进行统计（统计 2019年计算机领域的各个子领域的论文数量）

首先，我们要选出计算机领域的文章。那问题来了：
Q1：从哪里选出来呢？ = 数据中要包含爬取的数据和论文中的数据 = pandas 结合（融合数据的方法）
A1：从融合后的数据中选择

Q2：怎么样才能选出计算机领域的呢？ = 用pandas查询某个类别
A2：先**查看（用眼看）计算机领域是列表头的哪一个，然后再查询（有个方法叫 query() 进行查询）**这个值。

Q3：然后怎么统计？
A3：把大类和小类进行分组，然后进行统计

具体操作：

group_name = "Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["group_name", "category_name"]).count().reset_index()

	group_name	category_name	id	categories	year	archive_name	archive_id	category_description
0	Computer Science	Artificial Intelligence	558	558	558	558	558	558
1	Computer Science	Computation and Language	2153	2153	2153	2153	2153	2153
2	Computer Science	Computational Complexity	131	131	131	131	131	131
3	Computer Science	Computational Engineering, Finance, and Science	108	108	108	108	108	108
4	Computer Science	Computational Geometry	199	199	199	199	199	199
5	Computer Science	Computer Science and Game Theory	281	281	281	281	281	281
6	Computer Science	Computer Vision and Pattern Recognition	5559	5559	5559	5559	5559	5559
7	Computer Science	Computers and Society	346	346	346	346	346	346
8	Computer Science	Cryptography and Security	1067	1067	1067	1067	1067	1067
9	Computer Science	Data Structures and Algorithms	711	711	711	711	711	711
10	Computer Science	Databases	282	282	282	282	282	282
11	Computer Science	Digital Libraries	125	125	125	125	125	125
12	Computer Science	Discrete Mathematics	84	84	84	84	84	84
13	Computer Science	Distributed, Parallel, and Cluster Computing	715	715	715	715	715	715
14	Computer Science	Emerging Technologies	101	101	101	101	101	101
15	Computer Science	Formal Languages and Automata Theory	152	152	152	152	152	152
16	Computer Science	General Literature	5	5	5	5	5	5
17	Computer Science	Graphics	116	116	116	116	116	116
18	Computer Science	Hardware Architecture	95	95	95	95	95	95
19	Computer Science	Human-Computer Interaction	420	420	420	420	420	420
20	Computer Science	Information Retrieval	245	245	245	245	245	245
21	Computer Science	Logic in Computer Science	470	470	470	470	470	470
22	Computer Science	Machine Learning	177	177	177	177	177	177
23	Computer Science	Mathematical Software	27	27	27	27	27	27
24	Computer Science	Multiagent Systems	85	85	85	85	85	85
25	Computer Science	Multimedia	76	76	76	76	76	76
26	Computer Science	Networking and Internet Architecture	864	864	864	864	864	864
27	Computer Science	Neural and Evolutionary Computing	235	235	235	235	235	235
28	Computer Science	Numerical Analysis	40	40	40	40	40	40
29	Computer Science	Operating Systems	36	36	36	36	36	36
30	Computer Science	Other Computer Science	67	67	67	67	67	67
31	Computer Science	Performance	45	45	45	45	45	45
32	Computer Science	Programming Languages	268	268	268	268	268	268
33	Computer Science	Robotics	917	917	917	917	917	917
34	Computer Science	Social and Information Networks	202	202	202	202	202	202
35	Computer Science	Software Engineering	659	659	659	659	659	659
36	Computer Science	Sound	7	7	7	7	7	7
37	Computer Science	Symbolic Computation	44	44	44	44	44	44
38	Computer Science	Systems and Control	415	415	415	415

这种结果看起来不好看，没有对重复的数据进行归类。

因为我们现在就知道我们是对计算机领域进行的处理，那么这张表我们至于要category_name 和 id 和年份 2019 即可

要重新对表进行定义

没想出来。

看答案是用的 pivot() 方法：
查看这个方法：python pandas库——pivot使用心得解决了EXCEL的变换问题

那么问：
pivot() 方法解决了什么问题？？？？
一文看懂pandas的透视表pivot_table

Pandas透视表（pivot_table）详解

小王做笔记

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
学术前沿趋势分析_学习_论文数据统计Task1

学习记录任务说明任务主题：论文数量统计，即统计2019年全年计算机各个方向论文数量；任务内容：赛题的理解、使用 Pandas 读取数据并进行统计；任务成果：学习 Pandas 的基础操作；分析说明我们使用的数据是：2019年全年计算机各个方向论文数量。那么问题来了？数据在哪里？怎么获得？获得方法：使用python 爬虫爬取数据集因此便会使用到：import requests #用于网络连接，发送网络请求，使用域名获取对应信息from bs4 import BeautifulSoup #用
复制链接

扫一扫