论文数据分析-4(论文种类分类)_数据分类论文-CSDN博客

本文链接：https://blog.csdn.net/qq_36559719/article/details/113010172

任务4：论文种类分类

这部分内容作者还没有完成，先放出来大家参考，作者会继续补充，不喜勿喷

4.1 任务说明

学习主题：论文分类（数据建模任务），利用已有数据建模，对新论文进行类别分类；
学习内容：使用论文标题完成类别分类；
学习成果：学会文本分类的基本方法、TF-IDF等

4.3 文本分类思路

思路1：TF-IDF+机器学习分类器
直接使用TF-IDF对文本提取特征，使用分类器进行分类，分类器的选择上可以使用SVM、LR、XGboost等

思路2：FastText
FastText是入门款的词向量，利用Facebook提供的FastText工具，可以快速构建分类器

思路3：WordVec+深度学习分类器
WordVec是进阶款的词向量，并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRnn或者BiLSTM。

思路4：Bert词向量
Bert是高配款的词向量，具有强大的建模学习能力。

import pandas as pd 
import numpy as np
import re 
import json
import matplotlib.pyplot as plt

data  = [] #初始化
#使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open(r'arxiv-metadata-oai-2019.json', 'r') as f: 
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {'title': d['title'], 'categories': d['categories'], 'abstract': d['abstract']}
        data.append(d)
        
        # 选择部分数据
        if idx > 200000:
            break
        
data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析
data.head()

	title	categories	abstract
0	Remnant evolution after a carbon-oxygen white ...	astro-ph	We systematically explore the evolution of t...
1	Cofibrations in the Category of Frolicher Spac...	math.AT	Cofibrations are defined in the category of ...
2	Torsional oscillations of longitudinally inhom...	astro-ph	We explore the effect of an inhomogeneous ma...
3	On the Energy-Momentum Problem in Static Einst...	gr-qc	This paper has been removed by arXiv adminis...
4	The Formation of Globular Cluster Systems in M...	astro-ph	The most massive elliptical galaxies show a ...

为了方便数据的处理，我们可以将标题和摘要拼接一起完成分类。

data['text'] = data['title'] + data['abstract']

data['text'] = data['text'].apply(lambda x: x.replace('\n',' '))
data['text'].head()

0    Remnant evolution after a carbon-oxygen white ...
1    Cofibrations in the Category of Frolicher Spac...
2    Torsional oscillations of longitudinally inhom...
3    On the Energy-Momentum Problem in Static Einst...
4    The Formation of Globular Cluster Systems in M...
Name: text, dtype: object

data['text'] = data['text'].apply(lambda x: x.lower())
data['text'].head()

0    remnant evolution after a carbon-oxygen white ...
1    cofibrations in the category of frolicher spac...
2    torsional oscillations of longitudinally inhom...
3    on the energy-momentum problem in static einst...
4    the formation of globular cluster systems in m...
Name: text, dtype: object

data = data.drop(['abstract', 'title'], axis=1)

原始论文有可能有多个类别

# 多个类别，包含子分类
data['categories'] = data['categories'].apply(lambda x : x.split(' '))
data['categories'].head()

0    [astro-ph]
1     [math.AT]
2    [astro-ph]
3       [gr-qc]
4    [astro-ph]
Name: categories, dtype: object

# 单个类别，不包含子分类
data['categories_single'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x])
data['categories_single'].head()

0    [astro-ph]
1        [math]
2    [astro-ph]
3       [gr-qc]
4    [astro-ph]
Name: categories_single, dtype: object

将类别进行编码，这里类别是多个，所以需要多编码：

import sklearn
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_single'].iloc[:])#获取标签
data_label[:5]

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

思路1

思路1使用TFIDF提取特征，限制最多4000个单词：

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])

由于这里是多标签分类，可以使用sklearn的多标签分类进行封装：

# 划分训练集和验证集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,
                                                 test_size = 0.2,random_state = 1)

# 构建多标签分类模型
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)

from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(x_test)))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         0
           3       0.91      0.85      0.88      3625
           4       0.00      0.00      0.00         4
           5       0.00      0.00      0.00         0
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         0
           8       0.77      0.76      0.77      3801
           9       0.84      0.89      0.86     10715
          10       0.00      0.00      0.00         0
          11       0.00      0.00      0.00       186
          12       0.44      0.41      0.42      1621
          13       0.00      0.00      0.00         1
          14       0.75      0.59      0.66      1096
          15       0.61      0.80      0.69      1078
          16       0.90      0.19      0.32       242
          17       0.53      0.67      0.59      1451
          18       0.71      0.54      0.62      1400
          19       0.88      0.84      0.86     10243
          20       0.40      0.09      0.15       934
          21       0.00      0.00      0.00         1
          22       0.87      0.03      0.06       414
          23       0.48      0.65      0.55       517
          24       0.37      0.33      0.35       539
          25       0.00      0.00      0.00         1
          26       0.60      0.42      0.49      3891
          27       0.00      0.00      0.00         0
          28       0.82      0.08      0.15       676
          29       0.86      0.12      0.21       297
          30       0.80      0.40      0.53      1714
          31       0.00      0.00      0.00         4
          32       0.56      0.65      0.60      3398
          33       0.00      0.00      0.00         0

   micro avg       0.76      0.70      0.72     47851
   macro avg       0.39      0.27      0.29     47851
weighted avg       0.75      0.70      0.71     47851
 samples avg       0.74      0.76      0.72     47851



C:\Users\zhoukaiwei\AppData\Roaming\Python\Python38\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\zhoukaiwei\AppData\Roaming\Python\Python38\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\zhoukaiwei\AppData\Roaming\Python\Python38\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))