NLP基本任务二:基于深度学习的文本分类

本博客参照了复旦大学计算机科学技术学院邱锡鹏教授的文章https://www.zhihu.com/question/324189960

题目:熟悉Pytorch,用Pytorch重写《任务一》,实现CNN、RNN的文本分类;

  1. 参考

    1. https://pytorch.org/
    2. Convolutional Neural Networks for Sentence Classification https://arxiv.org/abs/1408.5882
    3. https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
  2. word embedding 的方式初始化

  3. 随机embedding的初始化方式

  4. 用glove 预训练的embedding进行初始化 https://nlp.stanford.edu/projects/glove/

  5. 知识点:

    1. CNN/RNN的特征抽取
    2. 词嵌入
    3. Dropout

代码:

注:代码并没有严格参照要求去做。

import nltk
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pylab as plt
%matplotlib inline

import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"   #设置GPU

#载入数据,注意参数delimiter='\t'
df_train = pd.read_csv(r'sentiment-analysis-on-movie-reviews/train.tsv',delimiter='\t')
df_test = pd.read_csv(r'sentiment-analysis-on-movie-reviews/test.tsv',delimiter='\t')
df_train.head()

#创建transformer,CountVectorizer是属于常见的特征数值计算类,是一个文本特征提取方法。
word_vectorizer = CountVectorizer(ngram_range = (1,1),analyzer = 'word',stop_words = 'english',min_df = 0.001)  #使用默认的英语停用词表
spare_matric = word_vectorizer.fit_transform(df_train['Phrase'])  #编码文件,将文本转化为稀疏矩阵

# print(spare_matric)
#输出如:
# (0, 480)	1
#   (0, 352)	1
#   (0, 222)	2
#   (0, 451)	1
#   (1, 222)	1
#   (1, 451)	1
#   (2, 451)	1
# print(sum(spare_matric))  
#按列求和,输出如:
# (0, 570)	161
# (0, 28)	213

#统计文本单词出现的频率
# print(spare_matric.shape)  #(156060, 587)
frequency = sum(spare_matric).toarray()[0]  #toarray后如[[ 179  204  176  ]],所以要加[0]
# print(len(frequency))  #587
# print(frequency)
freq = pd.DataFrame(frequency,index = word_vectorizer.get_feature_names(),columns = ['frequency'])
freq.sort_values('frequency',ascending = False)

#观察标签的分布情况,发现稍微符合正态分布,若不符合正态分类,可以尝试通过log等函数进行转化来使其符合正态分布。
a = df_train.Sentiment.value_counts()  #统计一列中不同种类各有多少个
# a.plot(kind = 'bar')  #这种图默认按大小排序
# print(a.index)
# print(a.values)
plt.bar(a.index,a.values)

#更加好看的图
# a = pd.DataFrame(a)
# a['Rating'] = a.index
# sns.set_style("darkgrid", {"axes.facecolor": ".9"})
# fig, ax = plt.subplots(figsize=(10,6))
# sns.barplot(y='Sentiment', x='Rating', data=a)

#对文本进行预处理
import re
df_train['Phrase'] = df_train['Phrase'].str.lower()
df_train['Phrase'] = df_train['Phrase'].apply(lambda x: re.sub('[^a-zA-Z0-9\s]','',x))
df_test['Phrase'] = df_test['Phrase'].str.lower()
df_test['Phrase'] = df_test['Phrase'].apply(lambda x: re.sub('[^a-zA-Z0-9]','',x))
# print(df_train['Phrase'])

X_train = df_train.Phrase
y_train = df_train.Sentiment

#构造字典和训练数据
from keras.preprocessing.text import Tokenizer  #参考https://blog.csdn.net/lovebyz/article/details/77712003
tokenizer = Tokenizer()
# print(X_train)
#输出如下:
# 0         a series of escapades demonstrating the adage ...
# 1         a series of escapades demonstrating the adage ...
# 2                                                  a series
# 3                                                         a
# 4                                                    series
tokenizer.fit_on_texts(X_train.values)  #使用一系列文档来生成token词典,每个元素为一个文档。

X_train = tokenizer.texts_to_sequences(X_train)  #将多个文档转换为word下标的向量形式
# print(len(X_train))  #156060
# print(X_train[0])  #[2, 304, 3, 15110, 5906, 1, 6499, 9, 51, 8, 49, 13, 1, 3514, 8, 167, 49, 13, 1, 11381, 62, 3, 75, 615, 10453, 19, 576, 3, 75, 2003, 5, 54, 3, 2, 40]
# print(len(X_train[0]))  #35
# print(len(X_train[1]))  #14
# print(len(X_train[2]))  #2
X_test = df_test.Phrase
X_test = tokenizer.texts_to_sequences(X_test)

#将数据集统一长度,一般取最大长度
from keras.preprocessing.sequence import pad_sequences
max_length = max([len(x.split()) for x in df_train['Phrase']])
# print(max_length)  #48
X_train = pad_sequences(X_train,max_length)
X_test = pad_sequences(X_test,max_length)
# print(X_train.shape)  #(156060, 48)
# print(X_test.shape)  #(66292, 48)

#构建深度学习模型
from keras import Sequential
from keras.layers import Embedding,LSTM,Dense

EMBEDDING_DIM = 128
dict_len = len(tokenizer.word_index) + 1
model = Sequential()
model.add(Embedding(dict_len,EMBEDDING_DIM,input_length = max_length))  #参数https://blog.csdn.net/jiangpeng59/article/details/77533309
model.add(LSTM(units = 128,dropout = 0.2,recurrent_dropout = 0.2))  #第一个dropout是x和hidden之间的dropout,第二个是hidden-hidden之间的dropout
model.add(Dense(5,activation = 'softmax'))
model.compile(loss = 'sparse_categorical_crossentropy',optimizer= 'adam',metrics= ['accuracy'])
# print(model.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# embedding_4 (Embedding)      (None, 48, 128)           2099712   
# _________________________________________________________________
# lstm_3 (LSTM)                (None, 128)               131584    
# _________________________________________________________________
# dense_3 (Dense)              (None, 5)                 645       
# =================================================================
# Total params: 2,231,941
# Trainable params: 2,231,941
# Non-trainable params: 0
# _________________________________________________________________
# None

model.fit(X_train,y_train,batch_size= 128,epochs= 7,verbose= 1)
# Epoch 6/7
# 156060/156060 [==============================] - 101s 650us/step - loss: 0.5748 - acc: 0.7544
# Epoch 7/7
# 156060/156060 [==============================] - 101s 644us/step - loss: 0.5448 - acc: 0.7645

#模型预测,提交结果
y_test_pred = model.predict_classes(X_test)
final_pred = pd.read_csv(r'sentiment-analysis-on-movie-reviews/sampleSubmission.csv', sep=',')
final_pred.Sentiment=final_pred
final_pred.to_csv(r'results.csv', sep=',', index=False)

#使用CNN
from keras.layers import Conv1D,Dropout,MaxPooling1D,Flatten
def build_model():
    model = Sequential()
    model.add(Embedding(dict_len,output_dim=32,input_length = max_length))  
    model.add(Conv1D(filters = 32,kernel_size = 3,padding='same',activation='relu'))  
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(5,activation = 'softmax'))
    model.compile(loss = 'sparse_categorical_crossentropy',optimizer= 'adam',metrics= ['accuracy'])
    model.fit(X_train,y_train,batch_size= 128,epochs= 7,verbose= 1)
    return model

model2 = build_model()
# Epoch 6/7
# 156060/156060 [==============================] - 7s 45us/step - loss: 0.6345 - acc: 0.7340
# Epoch 7/7
# 156060/156060 [==============================] - 7s 43us/step - loss: 0.6068 - acc: 0.7462

 

  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
基于深度学习文本分类任务是指利用深度学习模型对文本进行情感分类。在这个任务中,我们使用了CNN和RNN模型来进行文本分类。数据集包含了15万余项英文文本,情感分为0-4共五类情感。任务的流程如下:输入数据→特征提取→神经网络设计→结果输出。 在特征提取阶段,我们使用了词嵌入(Word embedding)技术。词嵌入是一种将单词映射到低维向量空间的方法,它可以将单词的语义信息编码为向量表示。在本次任务中,我们参考了博客\[NLP-Beginner 任务:基于深度学习文本分类\](https://pytorch.org/Convolutional Neural Networks for Sentence Classification)中的方法,使用了预训练的词嵌入模型。 神经网络设计阶段,我们采用了卷积神经网络(CNN)和循环神经网络(RNN)的结合。具体来说,我们使用了四个卷积核,大小分别为2×d, 3×d, 4×d, 5×d。这样设计的目的是为了挖掘词组的特征。例如,2×d的卷积核用于挖掘两个连续单词之间的关系。在模型中,2×d的卷积核用红色框表示,3×d的卷积核用黄色框表示。 最后,我们将模型的输出结果进行分类,得到文本的情感分类结果。这个任务的目标是通过深度学习模型对文本进行情感分类,以便更好地理解和分析文本数据。 #### 引用[.reference_title] - *1* *3* [NLP-Brginner 任务:基于深度学习文本分类](https://blog.csdn.net/m0_61688615/article/details/128713638)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [NLP基本任务:基于深度学习文本分类](https://blog.csdn.net/Mr_green_bean/article/details/90480918)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值