利用 Keras 下的 LSTM 进行情感分析

最新推荐文章于 2025-09-30 16:12:52 发布

原创

最新推荐文章于 2025-09-30 16:12:52 发布 · 4.4w 阅读

104 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #Keras #LSTM #RNN #情感分析

本文介绍如何利用Keras构建和训练一个基于LSTM的情感分析模型。通过Kaggle情感分类数据集，经过数据预处理、模型构建、训练及预测，最终在测试集上达到99%的准确率。

前言

$~~~~~~~$ 我们用 Keras 提供的 LSTM 层构造和训练一个 many-to-one 的 RNN。网络的输入是一句话，输出是一个情感值（积极或消极）。所用数据来自 Kaggle 的情感分类比赛（https://inclass.kaggle.com/c/si650winter11）。该训练数据长这样：
1 $~~~$ I either LOVE Brokeback Mountain or think it’s great that homosexuality is becoming more acceptable!:
1 $~~~$ Anyway, thats why I love ” Brokeback Mountain.
1 $~~~$ Brokeback mountain was beautiful…
0 $~~~$ da vinci code was a terrible movie.
0 $~~~$ Then again, the Da Vinci code is super shitty movie, and it made like 700 million.
0 $~~~$ The Da Vinci Code comes out tomorrow, which sucks.
其中的每个句子都有个标签 1 或 0，用来代表积极或消极。(下载数据)

$~~~~$ 先把用到的包一次性全部导入

from keras.layers.core import Activation, Dense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
import nltk  #用来分词
import collections  #用来统计词频
import numpy as np

数据准备

$~~~~~$ 在开始前，先对所用数据做个初步探索。特别地，我们需要知道数据中有多少个不同的单词，每句话由多少个单词组成。

maxlen = 0  #句子最大长度
word_freqs = collections.Counter()  #词频
num_recs = 0 # 样本数
with open('./train.txt','r+') as f:
    for line in f:
        label, sentence = line.strip().split("\t")
        words = nltk.word_tokenize(sentence.lower())
        if len(words) > maxlen:
            maxlen = len(words)
        for word in words:
            word_freqs[word] += 1
        num_recs += 1
print('max_len ',maxlen)
print('nb_words ', len(word_freqs))

$~~~~$ max_len 42
$~~~~$ nb_words 2324

$~~~~~$ 可见一共有 2324 个不同的单词，包括标点符号。每句话最多包含 42 个单词。
$~~~~~$ 根据不同单词的个数 (nb_words)，我们可以把词汇表的大小设为一个定值，并且对于不在词汇表里的单词，把它们用伪单词 UNK 代替。根据句子的最大长度 (max_lens)，我们可以统一句子的长度，把短句用 0 填充。
$~~~~~$ 依前所述，我们把 VOCABULARY_SIZE 设为 2002。包含训练数据中按词频从大到小排序后的前 2000 个单词，外加一个伪单词 UNK 和填充单词 0。最大句子长度 MAX_SENTENCE_LENGTH 设为40。