SemEval2019Task3_ERC | (5) Bidirectional LSTM Network for Contextual ERC

原文下载 提取码:bjab

开源代码

目录

1. 比赛介绍

2. 模型描述

3. 实验

4. 代码


1. 比赛介绍

SemEval2019Task3_ERC是2019年Semantic Evaluation的第三个任务,对话情感识别。

使用的数据集是EmoContext,该对话数据集为纯文本数据集,来自社交平台。分为训练集、验证集和测试集。其中训练集、验证集、测试集各包含30,160、2755和5509个对话,每个对话都包含三轮(2人对话数据集(Person1,Person2,Person1)),因此训练集、验证集、测试集各包含90,480、8265和16,527个子句(utterances)。

这个数据集存在严重的类别不均衡现象,和其他数据不均衡现象有所区别,它在训练集比较均衡,但在验证集和测试集中每一个情感类别数据大约占总体的<4%(符合实际情况,实际对话中大部分子句是不包含任何情感的),具体统计情况如下:

与一般的判断给定文本/句子情感的任务不同,该任务的的目标是,给定一个对话(3轮),判断最后一轮/最后一个子句说话者所表达的情感,建模时需要利用对话的上下文(context)来判断最后一个子句的情感。

数据集的每个对话中,只有最后一个子句有情感标签,该数据集的情感标签分为三类:Happiness、Sadness、Anger还有一个附加的标签(others)。其中有5740个对话标签为Happiness,6816个对话标签为Sadness,6533个对话标签为Anger,剩余对话标签全为others。比赛采用的评估指标为:micro F1-score(只在三个情感类别Happiness、Sadness、Anger上计算micro F1-score,不包括others。本文所采用模型的最好结果是 0.7259)数据集样例如下图所示:

虽然这只是一个包含3轮对话且只有最后一轮对话有情感标签的数据集,但是可以把基于该数据集训练的模型,应用到更广泛的场景,如判断一个对话中任意一个子句的情感。假设该对话包含N个子句/N轮,若要判断第i个子句的情感,只需要把第i个子句连同第i-1,i-2个子句一同喂给训练好的模型,就可以判断第i个子句的情感了。其中i=1,...,N ,对话中每一个子句的情感就可以确定了(对 对话中前两个子句判断情感时,可以通过填充实现)。

 

2. 模型描述

  • 输入

把对话中的三个子句分别进行处理/输入。在输入之前使用Ekphrasis工具进行一些文本数据预处理操作,包括次拼写检查、小写化、分词等。(构建词典,将每个子句的词转换为索引,每个子句填充为统一的长度),将子句中的每个词/索引通过Embedding层,转换为词向量。Embedding层的词嵌入矩阵使用datastories embedding初始化(在330M的推特数据上预训练,300维,冻结)。对embedding 的结果添加高斯噪声。

  • turn-level encoder

在本专栏的第一篇博客中也使用了turn-level encoder,但是每一个turn使用一个单独的BiLSTM进行编码,权重不共享;本文采用的模型,turn1和turn3的BiLSTM编码器权重共享,turn2单独使用一个BiLSTM编码器。因为,turn1和turn3来自同一个说话人,权重共享有利于捕捉说话人特定的特征。

  • dialogue-level encoder

对每一个turn采用BiLSTM进行编码,取最后时刻的隐藏状态,三者拼接作为整个对话的表示。

  • 输出层

拼接后的表示先经过Dropout,在通过两个Dense层,最后通过softmax层进行多分类,计算交叉熵损失。

 

3. 实验

模型的各个变体:

1)LSTM1:使用单向LSTM encoder

2)LSTM2:本文采用的模型

3)LSTM3:相同说话人(turn1,turn3)对应的双向LSTM encoder不权重共享,即使用三个不同的双向LSTM encoder。

4)LSTMw:在LSTM2的基础上,添加权重正则化。

5)LSTMa:在BiLSTM encoder后添加注意力层

4. 代码

  • 加载数据
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
import numpy as np

import re
import io

label2emotion = {0: "others", 1: "happy", 2: "sad", 3: "angry"}
emotion2label = {"others": 0, "happy": 1, "sad": 2, "angry": 3}

emoticons_additional = {
    '(^・^)': '<happy>', ':‑c': '<sad>', '=‑d': '<happy>', ":'‑)": '<happy>', ':‑d': '<laugh>',
    ':‑(': '<sad>', ';‑)': '<happy>', ':‑)': '<happy>', ':\\/': '<sad>', 'd=<': '<annoyed>',
    ':‑/': '<annoyed>', ';‑]': '<happy>', '(^�^)': '<happy>', 'angru': 'angry', "d‑':":
        '<annoyed>', ":'‑(": '<sad>', ":‑[": '<annoyed>', '(�?�)': '<happy>', 'x‑d': '<laugh>',
}
#定义数据预处理过程

text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
               'time', 'url', 'date', 'number'],
    # terms that will be annotated
    annotate={"hashtag", "allcaps", "elongated", "repeated",
              'emphasis', 'censored'},
    fix_html=True,  # fix HTML tokens
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter="twitter",
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector="twitter",
    unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=True,  # Unpack contractions (can't -> can not)
    spell_correct_elong=True,  # spell correction for elongated words
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts=[emoticons, emoticons_additional]
)


def tokenize(text):
    text = " ".join(text_processor.pre_process_doc(text))
    return text


def preprocessData(dataFilePath, mode):
    conversations = []
    labels = []
    with io.open(dataFilePath, encoding="utf8") as finput:
        finput.readline()
        for line in finput: #逐行遍历 
            line = line.strip().split('\t') #切分 id turn1 turn2 turn3 label
            for i in range(1, 4): #对turn1 turn2 turn3 进行预处理
                line[i] = tokenize(line[i])
            if mode == "train": #添加标签
                labels.append(emotion2label[line[4]])
            conv = line[1:4] #一个对话 3轮
            conversations.append(conv)
    if mode == "train": #train模式 包含标签
        return np.array(conversations), np.array(labels)
    else:#不包含标签
        return np.array(conversations)
#分别对训练集、验证集、测试集进行处理
texts_train, labels_train = preprocessData('./train.txt', mode="train")
texts_dev, labels_dev = preprocessData('./dev.txt', mode="train")
texts_test, labels_test = preprocessData('./test.txt', mode="train")

 

  • 加载词嵌入
def getEmbeddings(file): #读取预训练词向量
    embeddingsIndex = {} #词到词向量的映射
    dim = 0
    with io.open(file, encoding="utf8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            embeddingVector = np.asarray(values[1:], dtype='float32')
            embeddingsIndex[word] = embeddingVector 
            dim = len(embeddingVector)
    return embeddingsIndex, dim


def getEmbeddingMatrix(wordIndex, embeddings, dim):
    embeddingMatrix = np.zeros((len(wordIndex) + 1, dim))
    for word, i in wordIndex.items():
        embeddingMatrix[i] = embeddings.get(word)
    return embeddingMatrix
from keras.preprocessing.text import Tokenizer

#读取预训练词向量
embeddings, dim = getEmbeddings('emosense.300d.txt')
#直接基于预训练词向量 中的单词构建词典
tokenizer = Tokenizer(filters='')
tokenizer.fit_on_texts([' '.join(list(embeddings.keys()))])

wordIndex = tokenizer.word_index #词到索引的映射
print("Found %s unique tokens." % len(wordIndex))

embeddings_matrix = getEmbeddingMatrix(wordIndex, embeddings, dim) #使用预训练初始化的词嵌入矩阵
  • 数据预处理
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

MAX_SEQUENCE_LENGTH = 24 #子句统一填充长度
#对原始训练集 随机切出20% 作为验证集
X_train, X_val, y_train, y_val = train_test_split(texts_train, labels_train, test_size=0.2, random_state=42)

#把标签 转换为one-hot形式
labels_categorical_train = to_categorical(np.asarray(y_train))
labels_categorical_val = to_categorical(np.asarray(y_val))
labels_categorical_dev = to_categorical(np.asarray(labels_dev))
labels_categorical_test = to_categorical(np.asarray(labels_test))


def get_sequances(texts, sequence_length):
    message_first = pad_sequences(tokenizer.texts_to_sequences(texts[:, 0]), sequence_length) #turn1
    message_second = pad_sequences(tokenizer.texts_to_sequences(texts[:, 1]), sequence_length) #turn2
    message_third = pad_sequences(tokenizer.texts_to_sequences(texts[:, 2]), sequence_length)#turn3
    return message_first, message_second, message_third

#对各个数据集进行处理 把每个对话的每一轮 的单词转换为词典中的索引 填充为统一的长度
message_first_message_train, message_second_message_train, message_third_message_train = get_sequances(X_train, MAX_SEQUENCE_LENGTH)
message_first_message_val, message_second_message_val, message_third_message_val = get_sequances(X_val, MAX_SEQUENCE_LENGTH)
message_first_message_dev, message_second_message_dev, message_third_message_dev = get_sequances(texts_dev, MAX_SEQUENCE_LENGTH)
message_first_message_test, message_second_message_test, message_third_message_test = get_sequances(texts_test, MAX_SEQUENCE_LENGTH)
  • 模型
from keras.layers import Input, Dense, Embedding, Concatenate, Activation, \
    Dropout, LSTM, Bidirectional, GlobalMaxPooling1D, GaussianNoise
from keras.models import Model


def buildModel(embeddings_matrix, sequence_length, lstm_dim, hidden_layer_dim, num_classes, 
               noise=0.1, dropout_lstm=0.2, dropout=0.2):
    
    #(batch,seq_len)
    turn1_input = Input(shape=(sequence_length,), dtype='int32')
    turn2_input = Input(shape=(sequence_length,), dtype='int32')
    turn3_input = Input(shape=(sequence_length,), dtype='int32')
    
    embedding_dim = embeddings_matrix.shape[1] #词嵌入维度
    #Embedding层 使用预训练词向量 初始化词嵌入矩阵
    embeddingLayer = Embedding(embeddings_matrix.shape[0],
                                embedding_dim,
                                weights=[embeddings_matrix],
                                input_length=sequence_length,
                                trainable=False) #冻结 
    
    #(batch,seq_len,embed_size)
    turn1_branch = embeddingLayer(turn1_input)
    turn2_branch = embeddingLayer(turn2_input) 
    turn3_branch = embeddingLayer(turn3_input) 
    
    #添加高斯噪声 (batch,seq_len,embed_size)
    turn1_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn1_branch)
    turn2_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn2_branch)
    turn3_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn3_branch)

    #定义两个双向lstm  
    lstm1 = Bidirectional(LSTM(lstm_dim, dropout=dropout_lstm))
    lstm2 = Bidirectional(LSTM(lstm_dim, dropout=dropout_lstm))
    
    #turn1 turn3是同一个说话人 共享权重  返回最后时刻的隐藏状态 前向后向拼接(batch,hidden_size*2=64*2=128)
    turn1_branch = lstm1(turn1_branch)
    turn2_branch = lstm2(turn2_branch)
    turn3_branch = lstm1(turn3_branch)
    
    #三个隐藏状态拼接 (batch,128*3=384)
    x = Concatenate(axis=-1)([turn1_branch, turn2_branch, turn3_branch])
    
    #经过dropout
    x = Dropout(dropout)(x)
    
    #通过dense层 (batch,hidden_layer_dim=30)
    x = Dense(hidden_layer_dim, activation='relu')(x)
    
    #输出层 softmax激活函数 多分类  (batch,num_classes=4)
    output = Dense(num_classes, activation='softmax')(x)
    
    model = Model(inputs=[turn1_input, turn2_input, turn3_input], outputs=output)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
    
    return model

model = buildModel(embeddings_matrix, MAX_SEQUENCE_LENGTH, lstm_dim=64, hidden_layer_dim=30, num_classes=4) 
  • 相关指标
from kutilities.callbacks import MetricsCallback, PlottingCallback
from sklearn.metrics import f1_score, precision_score, recall_score
from keras.callbacks import ModelCheckpoint, TensorBoard

#定义 训练过程中需要计算的指标 
metrics = {
    "f1_e": (lambda y_test, y_pred:     #只对情感标签对应的类别 计算micro-f1-score
             f1_score(y_test, y_pred, average='micro',
                      labels=[emotion2label['happy'],
                              emotion2label['sad'],
                              emotion2label['angry']
                              ])),
    "precision_e": (lambda y_test, y_pred:#只对情感标签对应的类别 计算micro-precision
                    precision_score(y_test, y_pred, average='micro',
                                    labels=[emotion2label['happy'],
                                            emotion2label['sad'],
                                            emotion2label['angry']
                                            ])),
    "recoll_e": (lambda y_test, y_pred:#只对情感标签对应的类别 计算micro-recall
                 recall_score(y_test, y_pred, average='micro',
                              labels=[emotion2label['happy'],
                                      emotion2label['sad'],
                                      emotion2label['angry']
                                      ]))
}

_datasets = {}
#开发集
_datasets["dev"] = [[message_first_message_dev, message_second_message_dev, message_third_message_dev],
                    np.array(labels_categorical_dev)]
#验证集
_datasets["val"] = [[message_first_message_val, message_second_message_val, message_third_message_val],
                    np.array(labels_categorical_val)]

#训练过程中 在开发集和验证集上计算 上述定义的指标
metrics_callback = MetricsCallback(datasets=_datasets, metrics=metrics)

filepath = "models/bidirectional_LSTM_best_weights_{epoch:02d}-{val_acc:.4f}.hdf5"
#保存在验证集上准确率最高的那组参数
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', save_best_only=True, save_weights_only=False,
                             mode='auto', period=1)
#日志 可视化
tensorboardCallback = TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)
  • 训练
#训练
history = model.fit([message_first_message_train, message_second_message_train, message_third_message_train],
                    np.array(labels_categorical_train),
                    callbacks=[metrics_callback, checkpoint, tensorboardCallback],
                    validation_data=(
                        [message_first_message_val, message_second_message_val, message_third_message_val],
                        np.array(labels_categorical_val)
                    ),
                    epochs=20,
                    batch_size=200)
  • 性能评估
model.load_weights("models/bidirectional_LSTM_best_weights_0010-0.9125.hdf5")#加载保存的权重
y_pred = model.predict([message_first_message_dev, message_second_message_dev, message_third_message_dev]) #在开发集上进行预测

from sklearn.metrics import classification_report

for title, metric in metrics.items(): #开发集预测的标签 和 真实标签 计算上述定义的指标(只计算情感标签对应的类别 不包含others)
    print(title, metric(labels_categorical_dev.argmax(axis=1), y_pred.argmax(axis=1)))
print(classification_report(labels_categorical_dev.argmax(axis=1), y_pred.argmax(axis=1))) #总体分类报告

y_pred = model.predict([message_first_message_test, message_second_message_test, message_third_message_test])#在测试集上进行预测

for title, metric in metrics.items():#测试集预测的标签 和 真实标签 计算上述定义的指标(只计算情感标签对应的类别 不包含others)
    print(title, metric(labels_categorical_test.argmax(axis=1), y_pred.argmax(axis=1)))
print(classification_report(labels_categorical_test.argmax(axis=1), y_pred.argmax(axis=1)))#总体分类报告

 

 

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值