文本情感倾向分析——神经网络模型

最新推荐文章于 2024-08-08 07:48:30 发布

Yue_kk

最新推荐文章于 2024-08-08 07:48:30 发布

阅读量2.7k

点赞数 4

分类专栏： NLP 文章标签：神经网络自然语言处理深度学习 python

本文链接：https://blog.csdn.net/m0_46144891/article/details/118934203

版权

NLP 专栏收录该内容

2 篇文章 2 订阅

订阅专栏

本文探讨了如何通过构建词典和WordEmbedding技术将文本转化为向量，进而使用BiLSTM进行情感分析。详细介绍了词向量表示、Word2Vec模型和RNN特别是LSTM在处理句子理解中的关键作用，以及在实际案例中的应用和模型训练过程。

摘要由CSDN通过智能技术生成

1. 方法

词的向量表示的原理：我们可以将一句话中的每一个词都转换成一个向量，下面这句话有16个单词，可以将输入数据看成是一个16*D的矩阵。

在这里插入图片描述

(1) 构建词典：把文本中的每个词语和其对应的数字，使用字典保存，同时实现方法把句子通过字典映射为包含数字的列表。

构建词典基本思路：

1）对所有句子进行分词。

2）词语存入字典，根据次数对词语进行过滤，并统计次数。

3）实现文本转数字序列的方法。

4）实现数字序列转文本的方法。

 # 用于构建词典的语料库
 sentences = [["今天","天气","很","好"],["今天","去","吃","什么"]]
 # 构建词典
 ws = Vocab()
 for sentence in sentences:
   #统计词频
   ws.fit(sentence)
 
 # 根据词频构造词典
 ws.build_vocab(min_count = 1)    print(ws.dict)
 >>> {'<UNK>':1, '<PAD>':0, '今天':2, '天气':3, '很':4, '好':5, '去':6, '吃':7, '什么':8}
 
 # 把句子转换为数字序列
 ret = ws.transform(["好","好"，"好","好","好","好","好","热","呀"], max_len = 13)
 print(ret)
 >>> [5,5,5,5,5,5,5,1,1,0,0,0,0]
 
 # 把数字序列转换成句子
 ret = ws.inverse_transform(ret)
 print(ret)
 >>>['好','好','好','好','好','好','好','<UNK>','<UNK>','<PAD>','<PAD>','<PAD>','<PAD>','<PAD>']

(2) 词向量表示（Word Embedding）

因为文本不能够直接被模型计算，所以需要将其转化为向量，常用的有one-hot编码和word embedding方法，这里使用word embedding。

word embedding是深度学习中表示文本常用的一种方法。和one-hot编码不同，wod embedding使用了浮点型的稠密矩阵来表示token。根据词典的大小，我们的向量通常使用不同的维度，如100，256，300等。其中向量中的每一个值是一个参数，其初始值是随机生成的，之后会在训练中进行学习而获得。两个向量之间是有关系的，可以进行相似的的计算。

token —> num —>vector

2.1 使用word embedding API：torch.nn.Embedding(num_embeddings, embedding_dim)

   """
   param
   1. num_embedding: 词典的大小
   2. embedding_dim: embedding的维度
   """
   embedding = nn.Embedding(vocab_size, 300) # 实例化
   input_embeded = embedding(input) # 进行embedding的操作

2.2 使用Word2Vec

Word2Vec可以用高维向量来表示词语，并把意思相近的词语放在相近的位置。我们只需要有大量的某种语言的语料，就可以用它来训练模型。
在这里插入图片描述
假设我们输入的句子是“I thought the movie was incredible and inspiring”。为了得到词向量，我们可以用TensorFlow的嵌入函数embedding_lookup( )。该函数包含两个参数，一个是嵌入矩阵（词向量矩阵），另一个是每个单词对应的索引。最终得到一个句子的向量。
在这里插入图片描述
(3) 构建神经网络和训练模型
RNNs的使用原理：Word2Vec将词语转化为高维向量后，一个句子就对应着词向量的集合，使用RNNs可以将高维的句向量编码为较低维度的一维向量，而保留大多数有用的信息。

RNN计算过程：一个词一个词地读进去，每读一个词，就对这句话想表达的意思产生一个联想，在把整句话读完以后，就可以大概了解整句话的意思了。循环神经网络最大的特点就是，神经元的输出可以在下一个时间直接作用到自身（作为输入），这就使得这种模型具有短期记忆的能力。在这里插入图片描述
LSTM：在RNN的基础上增加了记忆和遗忘功能，解决长期依赖。我们将机器对一个句子的理解称之为状态，一个输入只有能够通过输入门才能够进入到状态中，大多数的词都通过不了输入门，只有少数的关键的词能够进入到状态中去。随着状态读的词越来越长，在状态中的词也越来越多，状态中的词会通过遗忘门进行自循环，只有能够通过遗忘门的词才能够保留下来。最终状态中词的数量达到一个平衡。在输出的时候有一个输出门，只有能够通过输出门的词才能够被输出。即LSTM中的三重门：

1）输入门：决定了哪些词能够进入记忆。

2）遗忘门：决定了哪些词能够被继续记忆。

3）输出门：决定了哪些词能够被输出。

2. 代码（BiLSTM）

# -*- coding: utf-8 -*-
import jieba
import numpy as np
import pandas as pd

import multiprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from gensim.models.word2vec import Word2Vec
from gensim.corpora.dictionary import Dictionary

import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Bidirectional, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout

cpu_count = multiprocessing.cpu_count()  # 4
vocab_dim = 100
n_iterations = 1
n_exposures = 10  # 所有频数超过10的词语
window_size = 7
n_epoch = 30
maxlen = 100
batch_size = 32


def loadfile():
    neg = pd.read_csv('data/train_neg.csv', header=None, index_col=None)
    pos = pd.read_csv('data/train_pos.csv', header=None, index_col=None)

    combined = np.concatenate((pos[0],neg[0]))
    y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neg), dtype=int)))

    return combined, y

def tokenizer(data):
    text = [jieba.lcut(document.replace('\n', '')) for document in data]
    return text


def create_dictionaries(model=None, combined=None):

    if (combined is not None) and (model is not None):
        gensim_dict = Dictionary()
        gensim_dict.doc2bow(model.wv.vocab.keys(),
                            allow_update=True)
        #  freqxiao10->0 所以k+1
        w2indx = {v: k + 1 for k, v in gensim_dict.items()}  # 所有频数超过10的词语的索引,(k->v)=>(v->k)
        f = open("word2index.txt", 'w', encoding='utf8')
        for key in w2indx:
            f.write(str(key))
            f.write(' ')
            f.write(str(w2indx[key]))
            f.write('\n')
        f.close()
        w2vec = {word: model[word] for word in w2indx.keys()}  # 所有频数超过10的词语的词向量, (word->model(word))

        def parse_dataset(combined):  # 闭包-->临时使用
            data = []
            for sentence in combined:
                new_txt = []
                for word in sentence:
                    try:
                        new_txt.append(w2indx[word])
                    except:
                        new_txt.append(0)  # freqxiao10->0
                data.append(new_txt)
            return data  # word=>index

        combined = parse_dataset(combined)  # [[1,2,3...],[]]
        combined = sequence.pad_sequences(combined, maxlen=maxlen)  # 每个句子所含词语对应的索引，所以句子中含有频数小于10的词语，索引为0
        return w2indx, w2vec, combined
    else:
        print('No data provided...')


# 创建词语字典，并返回每个词语的索引，词向量，以及每个句子所对应的词语索引
def word2vec_train(combined):
    model = Word2Vec(size=vocab_dim,
                     min_count=n_exposures,
                     window=window_size,
                     workers=cpu_count,
                     iter=n_iterations)
    model.build_vocab(combined)  # input: list
    model.train(combined, total_examples=model.corpus_count, epochs=model.iter)
    model.save('./model/Word2vec_model.pkl')
    index_dict, word_vectors, combined = create_dictionaries(model=model, combined=combined)
    return index_dict, word_vectors, combined


def get_data(index_dict, word_vectors, combined, y):
    n_symbols = len(index_dict) + 1  # 所有单词的索引数，频数小于10的词语索引为0，所以加1
    embedding_weights = np.zeros((n_symbols, vocab_dim))  # 初始化 索引为0的词语，词向量全为0
    for word, index in index_dict.items():  # 从索引为1的词语开始，对每个词语对应其词向量
        embedding_weights[index, :] = word_vectors[word]
    x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2,random_state=5)
    y_train = keras.utils.to_categorical(y_train, num_classes=2)  # 转换为对应one-hot 表示  [len(y),2]
    y_test = keras.utils.to_categorical(y_test, num_classes=2)
    # print x_train.shape,y_train.shape
    return n_symbols, embedding_weights, x_train, y_train, x_test, y_test


##定义网络结构
def train_bilstm(n_symbols, embedding_weights, x_train, y_train):
    print('Defining a Simple Keras Model...')
    model = Sequential()  # or Graph or whatever
    model.add(Embedding(output_dim=vocab_dim,
                        input_dim=n_symbols,
                        mask_zero=True,
                        weights=[embedding_weights],
                        input_length=maxlen))  # Adding Input Length
    # model.add(LSTM(output_dim=50, activation='tanh'))
    model.add(Bidirectional(LSTM(output_dim=50, activation='tanh')))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))  # Dense=>全连接层,输出维度=2

    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])

    model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch, verbose=2)

    model.save('./model/bilstm.h5')


if __name__ == '__main__':
    # 训练模型，并保存
    print('加载数据集...')
    combined, y = loadfile()
    print(len(combined), len(y))
    print('数据预处理...')
    combined = tokenizer(combined)
    print('训练word2vec模型...')
    index_dict, word_vectors, combined = word2vec_train(combined)

    print('将数据转换为模型输入所需格式...')
    n_symbols, embedding_weights, x_train, y_train, x_test, y_test = get_data(index_dict, word_vectors, combined,
                                                                              y)
    print("特征与标签大小:")
    print(x_train.shape, y_train.shape)

    print('训练bilstm模型...')
    train_bilstm(n_symbols, embedding_weights, x_train, y_train)

    print('加载bilstm模型...')
    model = load_model('./model/bilstm.h5')
    
    # 预测
    y_pred = model.predict(x_test)

    for i in range(len(y_pred)):
        max_value = max(y_pred[i])
        for j in range(len(y_pred[i])):
            if max_value == y_pred[i][j]:
                y_pred[i][j] = 1
            else:
                y_pred[i][j] = 0
    # target_names = ['负面', '正面']
    print(classification_report(y_test, y_pred))