[TensorFlow2]使用LSTM对英文文本进行情感分类

本文详细介绍了一个情感分析模型的构建过程,从数据预处理、特征提取到模型训练,涵盖了使用pandas、BeautifulSoup、nltk、gensim和tensorflow等库的具体步骤。通过去除HTML标签、停用词,并利用Word2Vec进行词向量化,最终搭建了基于LSTM的深度学习模型,实现了文本情感倾向的自动判断。
摘要由CSDN通过智能技术生成

待处理数据

数据其实也没啥好说的,就是由段落和表征这个段落情感的标志所构成的一个个对,用pandas读进来长这个样子:

idsentimentreview
5814_81With all this stuff going down at the moment w…
381_91“The Classic War of the Worlds” by Timothy Hin…
7759_30The film starts with a manager (Nicholas Bell)…
3630_40It must be assumed that those who praised this…
9495_81Superbly trashy and wondrously unpretentious 8…

其中,sentiment为1时,表示review中对应段落的情感是正面的。注意到数据来源是网络爬虫爬取的结果,因此数据难免会夹带HTML的一些标签,比如下面的这段材料中就有br标签,因此我们还得对这些数据做进一步处理

…feeling towards the press and also the obvious message of drugs are bad m’kay.<br /><br />Visually impressive but of course this is all about Michael Jackson…

将会用到的库

除了tensorflow、numpy、pandas、sklearn,还需导入一些其他的库:

BeautifulSoup

BeautifulSoup是一个可以从HTML或者XML文件中提取数据的库,通过这个工具,可以轻易过滤掉爬取到的标签,从而获得干净的文字部分信息。

nltk

nltk是一款用于自然语言处理的工具包,功能强大,但在本文中仅仅用到了它的停用词部分,其实不用这个,自己去下载另外的停用词库(stopwords.txt)再导到程序中也是没啥问题的~

对于这个库的安装其实是有一些坑的。。。就比如当pip install nltk之后,

import nltk
nltk.download()

执行完代码后可能会出现失去与远程计算机连接balabala的错误,具体做法在这

gensim

本文中,gensim库用于单词的词向量化,如我们可以设置词向量的长度为200,那么对于一个个的单词apple、banana等就会被转化成一个个长度为200的向量,这样一来就可以用于后续Embedding层的词嵌套了。不过通过gensim的Word2Vec得到的单词和词向量的映射还不够,我们还需要得到单词和单词在模型中序号的映射,这个映射会将一整句话转化成序号的列表,如下图所示:
在这里插入图片描述

实现细节

训练

# 屏蔽警告
import warnings
warnings.filterwarnings('ignore')

上面这段代码用于屏蔽运行过程中出现的警告信息,如:

D:\Anaconda\lib\site-packages\bs4_init_.py:273: UserWarning: “b’.’” looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
’ Beautiful Soup.’ % markup)
D:\Anaconda\lib\site-packages\bs4_init_.py:273: UserWarning: “b’…’” looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
’ Beautiful Soup.’ % markup)
D:\Anaconda\lib\site-packages\bs4_init_.py:336: UserWarning: “http://www.happierabroad.com” looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
’ that document to Beautiful Soup.’ % decoded_markup

首先读入数据:

df=pd.read_csv('../data/labeledTrainData.tsv',sep='\t',escapechar='\\')

import nltk
from nltk.corpus import stopwords
# 读入英文停用词,用于去除一些无关文本情感的词,比如a、an等等
eng_stopwords=set(stopwords.words('english'))

定义方法,并优化爬取到的文本:

import re
from bs4 import BeautifulSoup

def clean_text(text):
    # 去除标签,获取实实在在的文本信息
    text=BeautifulSoup(text,'html.parser').get_text()
    # 过滤标点符号
    text=re.sub(r'[^a-zA-Z]',' ',text)
    # 将词汇转为小写,并过滤掉停用词
    text=text.lower().split()
    text=[word for word in text if word not in eng_stopwords]
    return ' '.join(text)

cleaned_text=df.review.apply(clean_text)

sentence_list=[]
for line in cleaned_text:
    # 将过滤好的每句话分割成一个个单词
    sentence_list.append(line.split())

此时,sentence_list所存储的元素为一个个单词列表:
在这里插入图片描述
之后就可以依此制作word2Vec模型了

from gensim.models.word2vec import Word2Vec

feature_nums=250 #词向量的长度
min_count=10 #单词出现次数比这个少的会被截断
worker_num=4 #控制训练的并行数
window=10 #当前词与预测词在一个句子中最大的间隔距离

model=Word2Vec(sentence_list,workers=worker_num,size=feature_nums,min_count=min_count,window=window)

# 之后若不需要再训练模型,则可以设置init_sim,以此节约内存
model.init_sims(replace=True)
# 保存训练好的模型
model_name='{}features_{}mincount_{}window.model'.format(feature_nums,min_count,window)
model.save(os.path.join('..','models',model_name))

测试一下训练好的模型:
在这里插入图片描述
可以看到,hello被转化成了长度为250的词向量。

但是光有这个模型还是不够的,我们还需要单词和序号的映射关系以及序号和词向量的映射关系。徒手直接将单词映射为词向量会比较费时,得到这两个映射关系,扔到Embedding里头就不愁了,速度也会比徒手转化的要快。

# 将模型中涉及到的所有单词取出
vocab_list=[word for word,Vocab in model.wv.vocab.items()]

word_index={' ':0} #存储单词到序号的映射
embeddings_matrix=np.zeros((len(vocab_list)+1,model.vector_size)) #序号到词向量的映射,实际上就是个矩阵而已

# 构建映射关系,注意腾出序号0存放空的向量,用于表示词库以外的陌生词
for i in range(len(vocab_list)):
    word=vocab_list[i]
    word_index[word]=i+1
    embeddings_matrix[i+1]=model.wv[word]

from tensorflow.keras.preprocessing.sequence import pad_sequences
# 将语句转化为序号列表
def tokenizer(texts,word_index):
    tokenized=[]
    for sentence in texts:
        new_txt=[]
        for word in sentence.split(): #切分语句
            try:
                new_txt.append(word_index[word])
            except:
                # 陌生词判定为0
                new_txt.append(0)
        tokenized.append(new_txt)
    # 将语句对应的序号列表整成同一长度,这样一来,短的语句会被补零,长的会被截断,
    # maxlen可以根据训练数据的统计特性而定
    tokenized=pad_sequences(tokenized,maxlen=300,padding='pre',truncating='pre') #这里使用了前置补零截断的方式
    return tokenized

y=df.sentiment.values
x=df.review.values

x=[clean_text(sentence) for sentence in x if sentence]
x=tokenizer(x,word_index)

这样经过了去停用词、序号化的语句就变成了这个样子:
在这里插入图片描述
这样一来,就可以开始进行训练了:

import tensorflow as tf
from tensorflow.keras.layers import LSTM,Dense,Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import Sequential
import numpy as np
import pandas as pd
from tensorflow.keras.layers import Embedding,Bidirectional
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2) #数据分割

training_model=Sequential([
    Embedding(input_dim=len(embeddings_matrix),output_dim=250,weights=[embeddings_matrix],input_length=300,trainable=False), #传输构建好的矩阵
    Bidirectional(LSTM(units=32,return_sequences=True)),
    LSTM(units=16),
    Dense(1,activation='sigmoid')
])
training_model.compile(loss='binary_crossentropy',optimizer=Adam(1e-3),metrics=['accuracy'])
training_model.fit(x_train,y_train,validation_split=0.1,epochs=10,batch_size=64,shuffle=True)

模型的结构:
在这里插入图片描述
模型训练情况:
在这里插入图片描述

测试

模型在测试集上的表现
在这里插入图片描述
可见,测试集上的准确度为88%

放几段话进去看看效果:

str=[‘I will not waste my time on it.’,
‘Just fine I think, maybe the backgroud music does a little harm to the whole.’,
“I had great expectations as I had just finished season 2 of Das Boot. I was disapointed. It fails to even come close to the suspense and the horror of the battles in the Atlantic during WW2. Watch Das Boot instead.”,
“Having served in the Cold War on both a destroyer and on a submarine, I found this story contrasting the tensions between both worlds. The action is shown from the bridge, CIC, and decks of the Greyhound, which BTW is the slang term for destroyers and those who serve on them. Hanks subtly conveys the ache of leaving a loved one behind and her presence with him during the battle. He and his crew feel the presence of the subs stalking the surface ships and the deaths of sailors both above and below the icy water. There is no perfect rendering of combat in film, but the repeated commands and protocols between naval personnel and vessels are accurate enough to convey a sense of proper urgency to the story. Compressing roughly 48 tense hours into a ~2 hour film doesn’t give much time to absorb all that’s happening, and that’s the point. Training and subsequent reactions shape the story in the faces of the bridge crew as they watch the captain and follow his orders which he does not explain. This is about relationships between combatants, among the ships in the convoy, and between U.S. and British allies. This film, The Enemy Below, and Das Boot make a reasonable trilogy for a weekend marathon. Enjoy this story from either a technical or a relational view as you see fit.”]

str=[clean_text(i) for i in str]
str=tokenizer(str,word_index)

training_model.predict(str)

输出结果为:

array([[0.08535763],
[0.49167582],
[0.40483615],
[0.6165438 ]], dtype=float32)
值越接近1表示越积极,越接近0表示越消极

模型预测外部的数据效果还是不咋地的,毕竟训练数据比较少,它接触的世面就比较少。。。

代码及数据
提取码:yu19

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值