Task4 论文种类分类

任务说明

  • 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类;
  • 学习内容:使用论文标题完成类别分类;
  • 学习成果:学会文本分类的基本方法、TF-IDF等;

数据处理步骤

在原始arxiv论文中论文都有对应的类别,而论文类别是作者填写的。在本次任务中我们可以借助论文的标题和摘要完成:

  • 对论文标题和摘要进行处理;
  • 对论文类别进行处理;
  • 构建文本分类模型;

文本分类思路

  • 思路1:TF-IDF+机器学习分类器

直接使用TF-IDF对文本提取特征,使用分类器进行分类,分类器的选择上可以使用SVM、LR、XGboost等

  • 思路2:FastText

FastText是入门款的词向量,利用Facebook提供的FastText工具,可以快速构建分类器

  • 思路3:WordVec+深度学习分类器

WordVec是进阶款的词向量,并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRnn或者BiLSTM。

  • 思路4:Bert词向量

Bert是高配款的词向量,具有强大的建模学习能力。

具体代码实现以及讲解

为了方便大家入门文本分类,我们选择思路1和思路2给大家讲解。首先完成字段读取:

# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'], count=None):
    '''
    定义读取文件的函数
        path: 文件路径
        columns: 需要选择的列
        count: 读取行数
    '''
    
    data  = []
    with open(path, 'r') as f: 
        for idx, line in enumerate(f): 
            if idx == count:
                break
                
            d = json.loads(line)
            d = {col : d[col] for col in columns}
            data.append(d)

    data = pd.DataFrame(data)
    return data

data = readArxivFile('arxiv-metadata-oai-snapshot.json', 
                     ['id', 'title', 'categories', 'abstract'])

data
idtitlecategoriesabstract
00704.0001Calculation of prompt diphoton production cros...hep-phA fully differential calculation in perturba...
10704.0002Sparsity-certifying Graph Decompositionsmath.CO cs.CGWe describe a new algorithm, the $(k,\ell)$-...
20704.0003The evolution of the Earth-Moon system based o...physics.gen-phThe evolution of Earth-Moon system is descri...
30704.0004A determinant of Stirling cycle numbers counts...math.COWe show that a determinant of Stirling cycle...
40704.0005From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...math.CA math.FAIn this paper we show how to compute the $\L...
...............
1796906supr-con/9608008On the origin of the irreversibility line in t...supr-con cond-mat.supr-conWe report on measurements of the angular dep...
1796907supr-con/9609001Nonlinear Response of HTSC Thin Film Microwave...supr-con cond-mat.supr-conThe non-linear microwave surface impedance o...
1796908supr-con/9609002Critical State Flux Penetration and Linear Mic...supr-con cond-mat.supr-conThe vortex contribution to the dc field (H) ...
1796909supr-con/9609003Density of States and NMR Relaxation Rate in A...supr-con cond-mat.supr-conWe show that the density of states in an ani...
1796910supr-con/9609004Ginzburg Landau theory for d-wave pairing and ...supr-con cond-mat.supr-conThe Ginzburg Landau theory for d_{x^2-y^2}-w...

1796911 rows × 4 columns

为了方便数据的处理,我们可以将标题和摘要拼接一起完成分类。

data['text'] = data['title'] + data['abstract']

data['text'] = data['text'].apply(lambda x: x.replace('\n',' '))
data['text'] = data['text'].apply(lambda x: x.lower())
data = data.drop(['abstract', 'title'], axis=1)  # df.drop()删除指定的行或列
data
idcategoriestext
00704.0001hep-phcalculation of prompt diphoton production cros...
10704.0002math.CO cs.CGsparsity-certifying graph decompositions we d...
20704.0003physics.gen-phthe evolution of the earth-moon system based o...
30704.0004math.COa determinant of stirling cycle numbers counts...
40704.0005math.CA math.FAfrom dyadic $\lambda_{\alpha}$ to $\lambda_{\a...
............
1796906supr-con/9608008supr-con cond-mat.supr-conon the origin of the irreversibility line in t...
1796907supr-con/9609001supr-con cond-mat.supr-connonlinear response of htsc thin film microwave...
1796908supr-con/9609002supr-con cond-mat.supr-concritical state flux penetration and linear mic...
1796909supr-con/9609003supr-con cond-mat.supr-condensity of states and nmr relaxation rate in a...
1796910supr-con/9609004supr-con cond-mat.supr-conginzburg landau theory for d-wave pairing and ...

1796911 rows × 3 columns

由于原始论文有可能有多个类别,所以也需要处理:

# 多个类别,包含子分类
data['categories'] = data['categories'].apply(lambda x : x.split(' '))  # 返回每个类别形成的列表

# 单个类别,不包含子分类
data['categories_big'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x])  # 返回每个大类类别形成的列表

然后将类别进行编码,这里类别是多个,所以需要多编码:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_big'].iloc[:])

data_label
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

思路1

思路1使用TFIDF提取特征:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])
data_tfidf
<1796911x598592 sparse matrix of type '<class 'numpy.float64'>'
	with 151272947 stored elements in Compressed Sparse Row format>

由于这里是多标签分类,可以使用sklearn的多标签分类进行封装:

# 划分训练集和验证集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,
                                                 test_size = 0.2,random_state = 1)

# 构建多标签分类模型
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(x_test)))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        13
           1       0.00      0.00      0.00       106
           2       0.00      0.00      0.00       275
           3       0.00      0.00      0.00         5
           4       0.97      0.82      0.89     56225
           5       0.00      0.00      0.00        32
           6       0.00      0.00      0.00         2
           7       0.00      0.00      0.00       482
           8       0.00      0.00      0.00        66
           9       0.00      0.00      0.00       175
          10       0.00      0.00      0.00        40
          11       0.91      0.73      0.81     59797
          12       0.86      0.84      0.85     62959
          13       0.00      0.00      0.00       157
          14       0.00      0.00      0.00       623
          15       0.16      0.00      0.00      5696
          16       0.00      0.00      0.00        83
          17       0.87      0.09      0.16     16911
          18       0.93      0.10      0.18      8578
          19       0.26      0.00      0.00      4726
          20       0.82      0.52      0.64     30796
          21       0.91      0.34      0.50     27934
          22       0.90      0.81      0.85     95914
          23       0.02      0.00      0.00     12695
          24       0.00      0.00      0.00        63
          25       0.12      0.00      0.00      7035
          26       0.11      0.00      0.00      4207
          27       0.56      0.01      0.01      9741
          28       0.00      0.00      0.00       151
          29       0.82      0.14      0.25     36026
          30       0.00      0.00      0.00         8
          31       0.00      0.00      0.00       312
          32       0.85      0.04      0.08      6604
          33       0.14      0.00      0.00      2408
          34       0.96      0.13      0.23     21601
          35       0.00      0.00      0.00       304
          36       0.84      0.06      0.11     16406
          37       0.00      0.00      0.00        40

   micro avg       0.90      0.53      0.66    489196
   macro avg       0.32      0.12      0.15    489196
weighted avg       0.82      0.53      0.59    489196
 samples avg       0.69      0.62      0.64    489196

思路2

思路2使用深度学习模型,单词进行词嵌入然后训练。将数据集处理进行编码,并进行截断:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data['text'].iloc[:100000], 
                                                    data_label[:100000],
                                                 test_size = 0.95,random_state = 1)

# parameter
max_features= 500
max_len= 150
embed_size=100
batch_size = 128
epochs = 5

from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence

tokens = Tokenizer(num_words = max_features)
tokens.fit_on_texts(list(data['text'].iloc[:100000]))

y_train = data_label[:100000]
x_sub_train = tokens.texts_to_sequences(data['text'].iloc[:100000])
x_sub_train = sequence.pad_sequences(x_sub_train, maxlen=max_len)

定义模型并完成训练:

# LSTM model
# Keras Layers:
from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU
from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D# Keras Callback Functions:
from keras.callbacks import Callback
from keras.callbacks import EarlyStopping,ModelCheckpoint
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras.models import Model
from keras.optimizers import Adam

sequence_input = Input(shape=(max_len, ))
x = Embedding(max_features, embed_size, trainable=True)(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool]) 
preds = Dense(38, activation="sigmoid")(x)

model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-3),metrics=['accuracy'])
model.fit(x_sub_train, y_train, 
          batch_size=batch_size, 
          validation_split=0.2,
          epochs=epochs)
WARNING:tensorflow:From D:\softwares\DevTool\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Train on 80000 samples, validate on 20000 samples
Epoch 1/5
80000/80000 [==============================] - 570s 7ms/step - loss: 0.1266 - accuracy: 0.9677 - val_loss: 0.0764 - val_accuracy: 0.9757
Epoch 2/5
80000/80000 [==============================] - 622s 8ms/step - loss: 0.0694 - accuracy: 0.9768 - val_loss: 0.0636 - val_accuracy: 0.9786
Epoch 3/5
80000/80000 [==============================] - 621s 8ms/step - loss: 0.0609 - accuracy: 0.9789 - val_loss: 0.0576 - val_accuracy: 0.9800
Epoch 4/5
80000/80000 [==============================] - 620s 8ms/step - loss: 0.0572 - accuracy: 0.9799 - val_loss: 0.0557 - val_accuracy: 0.9805
Epoch 5/5
80000/80000 [==============================] - 618s 8ms/step - loss: 0.0553 - accuracy: 0.9804 - val_loss: 0.0541 - val_accuracy: 0.9808
<keras.callbacks.callbacks.History at 0x1b6b7ca1bc8>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值