NLP-Task3:基于注意力机制的文本匹配

NLP-Task3:基于注意力机制的文本匹配

输入两个句子进行判断它们之间的关系,用双向注意力机制实现

数据集:https://nlp.stanford.edu/projects/snli/
参考论文:Enhanced LSTM for Natural Language Inference
论文解析:Note

  • 知识点
    • 注意力机制
    • token2token attention
本文参考:NLP-Beginner 任务三:基于注意力机制的文本匹配+pytorch

一、任务介绍

本次任务主要利用论文中提出的ESIM模型进行文本匹配

1.1 数据集

训练集共有55万余项,匹配关系共有四种:包含(Entailment)、矛盾(contradiction)、中立(Neutral)、未知(-)

二、特征提取—Word embedding

三、神经网络

参考论文Enhanced LSTM for Natural Language Inference

3.1 Input Encoding

3.2 Local Inference Modeling

3.3 Inference Composition

3.4 Output

3.5 训练神经网络

使用交叉熵损失函数进行训练

四、代码

4.1 实验设置

  • 样本个数:约55w
  • 训练集与测试集:7:3
  • 模型:ESIM
  • 初始化:随机初始化、GloVe预训练模型初始化
  • 学习率: 1 0 − 3 10^{-3} 103
  • Batch size:1000

4.2 具体代码

#feature
import random
import re
import torch
from torch.utils.data import Dataset, DataLoader

def data_split(data, test_rate=0.3):
    """ Take some data , and split them into training set and test set."""
    train = list()
    test = list()
    i = 0
    for datum in data:
        i += 1
        if random.random() > test_rate:
            train.append(datum)
        else:
            test.append(datum)
    return train, test


class Random_embedding():
    def __init__(self, data, test_rate=0.3):
        self.dict_words = dict()
        _data = [item.split(',') for item in data]
        self.data = [[item[5], item[6], item[0]] for item in _data]
        self.len_words = 0
        self.train, self.test = data_split(self.data, test_rate=test_rate)
        self.type_dict = {'-': 0, 'contradiction': 1, 'entailment': 2, 'neutral': 3}
        self.train_y = [self.type_dict[term[2]] for term in self.train]  # Relation in training set
        self.test_y = [self.type_dict[term[2]] for term in self.test]  # Relation in test set
        self.train_s1_matrix = list()
        self.test_s1_matrix = list()
        self.train_s2_matrix = list()
        self.test_s2_matrix = list()
        self.longest = 50

    def get_words(self):
        pattern = '[A-Za-z|\']+'
        for term in self.data:
            for i in range(2):
                s = term[i]
                s = s.upper()
                words = re.findall(pattern, s)
                for word in words:  # Process every word
                    if word not in self.dict_words:
                        self.dict_words[word] = len(self.dict_words)
        self.len_words = len(self.dict_words)

    def get_id(self):
        pattern = '[A-Za-z|\']+'
        for term in self.train:
            s = term[0]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item+=[self.len_words for _ in range(self.longest-len(item))]
            self.longest = max(self.longest, len(item))
            self.train_s1_matrix.append(item)
            s = term[1]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item += [self.len_words for _ in range(self.longest - len(item))]
            self.longest = max(self.longest, len(item))
            self.train_s2_matrix.append(item)
        for term in self.test:
            s = term[0]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item += [self.len_words for _ in range(self.longest - len(item))]
            self.longest = max(self.longest, len(item))
            self.test_s1_matrix.append(item)
            s = term[1]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item += [self.len_words for _ in range(self.longest - len(item))]
            self.longest = max(self.longest, len(item))
            self.test_s2_matrix.append(item)
        self.len_words+=1


class Glove_embedding():
    def __init__(self, data, trained_dict, test_rate=0.3):
        self.dict_words = dict()
        _data = [item.split(',') for item in data]
        self.data = [[item[5], item[6], item[0]] for item in _data]
        self.trained_dict = trained_dict
        self.len_words = 0
        self.train, self.test = data_split(self.data, test_rate=test_rate)
        self.type_dict = {'-': 0, 'contradiction': 1, 'entailment': 2, 'neutral': 3}
        self.train_y = [self.type_dict[term[2]] for term in self.train]  # Relation in training set
        self.test_y = [self.type_dict[term[2]] for term in self.test]  # Relation in test set
        self.train_s1_matrix = list()
        self.test_s1_matrix = list()
        self.train_s2_matrix = list()
        self.test_s2_matrix = list()
        self.longest = 60
        self.embedding = list()

    def get_words(self):
        pattern = '[A-Za-z|\']+'
        for term in self.data:
            for i in range(2):
                s = term[i]
                s = s.upper()
                words = re.findall(pattern, s)
                for word in words:  # Process every word
                    if word not in self.dict_words:
                        self.dict_words[word] = len(self.dict_words)
                        if word in self.trained_dict:
                            self.embedding.append(self.trained_dict[word])
                        else:
                            # print(word)
                            # raise Exception("words not found!")
                            self.embedding.append([0] * 50)
        self.embedding.append([0]*50)
        self.len_words = len(self.dict_words)

    def get_id(self):
        pattern = '[A-Za-z|\']+'
        for term in self.train:
            s = term[0]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item += [self.len_words for _ in range(self.longest - len(item))]
            self.longest = max(self.longest, len(item))
            self.train_s1_matrix.append(item)
            s = term[1]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item += [self.len_words for _ in range(self.longest - len(item))]
            self.longest = max(self.longest, len(item))
            self.train_s2_matrix.append(item)
        for term in self.test:
            s = term[0]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item += [self.len_words for _ in range(self.longest - len(item))]
            self.longest = max(self.longest, len(item))
            self.test_s1_matrix.append(item)
            s = term[1]
            s = s.upper()
            words = re.findall(pattern, s)
            item = [self.dict_words[word] for word in words]
            item += [self.len_words for _ in range(self.longest - len(item))]
            self.longest = max(self.longest, len(item))
            self.test_s2_matrix.append(item)
        self.len_words+=1


class ClsDataset(Dataset):
    """ 文本分类数据集 """
    def __init__(self, sentence1,sentence2, relation):
        self.sentence1 = sentence1
        self.sentence2 = sentence2
        self.relation = relation

    def __getitem__(self, item):
        return self.sentence1[item], self.sentence2[item],self.relation[item]

    def __len__(self):
        return len(self.relation)


def collate_fn(batch_data):
    """ 自定义一个batch里面的数据的组织方式 """

    sents1,sents2, labels = zip(*batch_data)

    return torch.LongTensor(sents1), torch.LongTensor(sents2),  torch.LongTensor(labels)


def get_batch(x1,x2,y,batch_size):
    dataset = ClsDataset(x1,x2, y)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True,drop_last=True,collate_fn=collate_fn)
    return dataloader

#data_split2
#ef data_split(data, test_rate=0.3):
#   np.random.seed(2021)
#   shuffled_indices =np.random.permutation(len(data))
#   test_size = int(len(data) * test_rate)
#   test_indices = shuffled_indices[: test_size]
#   train_indices = shuffled_indices[test_size: ]
#   print(test_indices)
#   train = list()
#   test = list()
#   for i in test_indices:
#       #print(i)
#       test.append(data[i])
#       
#   for j in train_indices:
#       #print(j)
#       train.append(data[j])
#    return train, test
#Neural_Network_batch
import torch
import torch.nn as nn


class Input_Encoding(nn.Module):
    def __init__(self, len_feature, len_hidden, len_words,longest, weight=None, layer=1, batch_first=True, drop_out=0.5):
        super(Input_Encoding, self).__init__()
        self.len_feature = len_feature
        self.len_hidden = len_hidden
        self.len_words = len_words
        self.layer = layer
        self.longest=longest
        self.dropout = nn.Dropout(drop_out)
        if weight is None:
            x = nn.init.xavier_normal_(torch.Tensor(len_words, len_feature))
            self.embedding = nn.Embedding(num_embeddings=len_words, embedding_dim=len_feature, _weight=x)
        else:
            self.embedding = nn.Embedding(num_embeddings=len_words, embedding_dim=len_feature, _weight=weight)
        self.lstm = nn.LSTM(input_size=len_feature, hidden_size=len_hidden, num_layers=layer, batch_first=batch_first,
                            bidirectional=True)

    def forward(self, x):
        x = torch.LongTensor(x)
        x = self.embedding(x)
        x = self.dropout(x)
        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)
        return x


class Local_Inference_Modeling(nn.Module):
    def __init__(self):
        super(Local_Inference_Modeling, self).__init__()
        self.softmax_1 = nn.Softmax(dim=1)
        self.softmax_2 = nn.Softmax(dim=2)

    def forward(self, a_bar, b_bar):
        e = torch.matmul(a_bar, b_bar.transpose(1, 2))

        a_tilde = self.softmax_2(e)
        a_tilde = a_tilde.bmm(b_bar)
        b_tilde = self.softmax_1(e)
        b_tilde = b_tilde.transpose(1, 2).bmm(a_bar)

        m_a = torch.cat([a_bar, a_tilde, a_bar - a_tilde, a_bar * a_tilde], dim=-1)
        m_b = torch.cat([b_bar, b_tilde, b_bar - b_tilde, b_bar * b_tilde], dim=-1)

        return m_a, m_b


class Inference_Composition(nn.Module):
    def __init__(self, len_feature, len_hidden_m, len_hidden, layer=1, batch_first=True, drop_out=0.5):
        super(Inference_Composition, self).__init__()
        self.linear = nn.Linear(len_hidden_m, len_feature)
        self.lstm = nn.LSTM(input_size=len_feature, hidden_size=len_hidden, num_layers=layer, batch_first=batch_first,
                            bidirectional=True)
        self.dropout = nn.Dropout(drop_out)

    def forward(self, x):
        x = self.linear(x)
        x = self.dropout(x)
        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)

        return x


class Prediction(nn.Module):
    def __init__(self, len_v, len_mid, type_num=4, drop_out=0.5):
        super(Prediction, self).__init__()
        self.mlp = nn.Sequential(nn.Dropout(drop_out), nn.Linear(len_v, len_mid), nn.Tanh(),
                                 nn.Linear(len_mid, type_num))

    def forward(self, a,b):

        v_a_avg=a.sum(1)/a.shape[1]
        v_a_max = a.max(1)[0]

        v_b_avg = b.sum(1) / b.shape[1]
        v_b_max = b.max(1)[0]

        out_put = torch.cat((v_a_avg, v_a_max,v_b_avg,v_b_max), dim=-1)

        return self.mlp(out_put)


class ESIM(nn.Module):
    def __init__(self, len_feature, len_hidden, len_words,longest, type_num=4, weight=None, layer=1, batch_first=True,
                 drop_out=0.5):
        super(ESIM, self).__init__()
        self.len_words=len_words
        self.longest=longest
        self.input_encoding = Input_Encoding(len_feature, len_hidden, len_words,longest, weight=weight, layer=layer,
                                             batch_first=batch_first, drop_out=drop_out)
        self.local_inference_modeling = Local_Inference_Modeling()
        self.inference_composition = Inference_Composition(len_feature, 8 * len_hidden, len_hidden, layer=layer,
                                                           batch_first=batch_first, drop_out=drop_out)
        self.prediction=Prediction(len_hidden*8,len_hidden,type_num=type_num,drop_out=drop_out)

    def forward(self,a,b):
        a_bar=self.input_encoding(a)
        b_bar=self.input_encoding(b)

        m_a,m_b=self.local_inference_modeling(a_bar,b_bar)

        v_a=self.inference_composition(m_a)
        v_b=self.inference_composition(m_b)

        out_put=self.prediction(v_a,v_b)

        return out_put
import matplotlib.pyplot
import torch
import torch.nn.functional as F
from torch import optim
import random
import numpy


def NN_embdding(model, train, test, learning_rate, iter_times):
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    loss_fun = F.cross_entropy
    train_loss_record = list()
    test_loss_record = list()
    train_record = list()
    test_record = list()
    # torch.autograd.set_detect_anomaly(True)

    for iteration in range(iter_times):
        model.train()
        for i, batch in enumerate(train):
            x1, x2, y = batch
            pred = model(x1, x2)
            optimizer.zero_grad()
            loss = loss_fun(pred, y)
            loss.backward()
            optimizer.step()

        model.eval()
        train_acc = list()
        test_acc = list()
        train_loss = 0
        test_loss = 0
        for i, batch in enumerate(train):
            x1, x2, y = batch
            pred = model(x1, x2)
            loss = loss_fun(pred, y)
            train_loss += loss
            _, y_pre = torch.max(pred, -1)
            acc = torch.mean((torch.as_tensor(y_pre == y, dtype=torch.float)))
            train_acc.append(acc)

        for i, batch in enumerate(test):
            x1, x2, y = batch
            pred = model(x1, x2)
            loss = loss_fun(pred, y)
            test_loss += loss
            _, y_pre = torch.max(pred, -1)
            acc = torch.mean((torch.as_tensor(y_pre == y, dtype=torch.float)))
            test_acc.append(acc)

        trains_acc = sum(train_acc) / len(train_acc)
        tests_acc = sum(test_acc) / len(test_acc)

        train_loss_record.append(train_loss / len(train_acc))
        test_loss_record.append(test_loss/ len(test_acc))
        train_record.append(trains_acc)
        test_record.append(tests_acc)
        print("---------- Iteration", iteration + 1, "----------")
        print("Train loss:", train_loss)
        print("Test loss:", test_loss)
        print("Train accuracy:", trains_acc)
        print("Test accuracy:", tests_acc)

    return train_loss_record, test_loss_record, train_record, test_record


def NN_plot(random_embedding, glove_embedding, len_feature, len_hidden, learning_rate, batch_size, iter_times):
    train_random = get_batch(random_embedding.train_s1_matrix, random_embedding.train_s2_matrix,
                             random_embedding.train_y,batch_size)
    test_random = get_batch(random_embedding.test_s1_matrix, random_embedding.test_s2_matrix,
                            random_embedding.test_y,batch_size)
    train_glove = get_batch(glove_embedding.train_s1_matrix, glove_embedding.train_s2_matrix,
                            glove_embedding.train_y,batch_size)
    test_glove = get_batch(random_embedding.test_s1_matrix, random_embedding.test_s2_matrix,
                           glove_embedding.test_y,batch_size)
    random.seed(2021)
    numpy.random.seed(2021)
    torch.cuda.manual_seed(2021)
    torch.manual_seed(2021)
    random_model = ESIM(len_feature, len_hidden, random_embedding.len_words, longest=random_embedding.longest)
    random.seed(2021)
    numpy.random.seed(2021)
    torch.cuda.manual_seed(2021)
    torch.manual_seed(2021)
    glove_model = ESIM(len_feature, len_hidden, random_embedding.len_words, longest=random_embedding.longest,
                       weight=torch.as_tensor(glove_embedding.embedding, dtype=torch.float))
    random.seed(2021)
    numpy.random.seed(2021)
    torch.cuda.manual_seed(2021)
    torch.manual_seed(2021)
    trl_ran, tsl_ran, tra_ran, tea_ran = NN_embdding(random_model, train_random, test_random, learning_rate,
                                                     iter_times)
    random.seed(2021)
    numpy.random.seed(2021)
    torch.cuda.manual_seed(2021)
    torch.manual_seed(2021)
    trl_glo, tsl_glo, tra_glo, tea_glo = NN_embdding(glove_model, train_glove, test_glove, learning_rate,
                                                     iter_times)
    x = list(range(1, iter_times + 1))
    matplotlib.pyplot.subplot(2, 2, 1)
    matplotlib.pyplot.plot(x, trl_ran, 'r--', label='random')
    matplotlib.pyplot.plot(x, trl_glo, 'g--', label='glove')
    matplotlib.pyplot.legend(fontsize=10)
    matplotlib.pyplot.title("Train Loss")
    matplotlib.pyplot.xlabel("Iterations")
    matplotlib.pyplot.ylabel("Loss")
    matplotlib.pyplot.subplot(2, 2, 2)
    matplotlib.pyplot.plot(x, tsl_ran, 'r--', label='random')
    matplotlib.pyplot.plot(x, tsl_glo, 'g--', label='glove')
    matplotlib.pyplot.legend(fontsize=10)
    matplotlib.pyplot.title("Test Loss")
    matplotlib.pyplot.xlabel("Iterations")
    matplotlib.pyplot.ylabel("Loss")
    matplotlib.pyplot.subplot(2, 2, 3)
    matplotlib.pyplot.plot(x, tra_ran, 'r--', label='random')
    matplotlib.pyplot.plot(x, tra_glo, 'g--', label='glove')
    matplotlib.pyplot.legend(fontsize=10)
    matplotlib.pyplot.title("Train Accuracy")
    matplotlib.pyplot.xlabel("Iterations")
    matplotlib.pyplot.ylabel("Accuracy")
    matplotlib.pyplot.ylim(0, 1)
    matplotlib.pyplot.subplot(2, 2, 4)
    matplotlib.pyplot.plot(x, tea_ran, 'r--', label='random')
    matplotlib.pyplot.plot(x, tea_glo, 'g--', label='glove')
    matplotlib.pyplot.legend(fontsize=10)
    matplotlib.pyplot.title("Test Accuracy")
    matplotlib.pyplot.xlabel("Iterations")
    matplotlib.pyplot.ylabel("Accuracy")
    matplotlib.pyplot.ylim(0, 1)
    matplotlib.pyplot.tight_layout()
    fig = matplotlib.pyplot.gcf()
    fig.set_size_inches(8, 8, forward=True)
    matplotlib.pyplot.savefig('main_plot.jpg')
    matplotlib.pyplot.show()
#main
with open('../input/stanford-natural-language-inference-corpus/snli_1.0_train.csv', 'r') as f:
    temp = f.readlines()

with open('../input/glove6b50dtxt/glove.6B.50d.txt', 'rb') as f:  # for glove embedding
    lines = f.readlines()

# Construct dictionary with glove

trained_dict = dict()
n = len(lines)
for i in range(n):
    line = lines[i].split()
    trained_dict[line[0].decode("utf-8").upper()] = [float(line[j]) for j in range(1, 51)]

data = temp[1:]
max_item = 100
data = data[:max_item]
learning_rate = 0.001
len_feature = 50
len_hidden = 50
iter_times = 20
batch_size = 10

# random embedding
random.seed(2021)
random_embedding = Random_embedding(data=data)
random_embedding.get_words()
random_embedding.get_id()

# trained embedding : glove
random.seed(2021)
glove_embedding = Glove_embedding(data=data, trained_dict=trained_dict)
glove_embedding.get_words()
glove_embedding.get_id()

NN_plot(random_embedding, glove_embedding, len_feature, len_hidden, learning_rate, batch_size, iter_times)

4.3 结果

未完待续

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
基于深度学习文本分类任务是指利用深度学习模型对文本进行情感分类。在这个任务中,我们使用了CNN和RNN模型来进行文本分类。数据集包含了15万余项英文文本,情感分为0-4共五类情感。任务的流程如下:输入数据→特征提取→神经网络设计→结果输出。 在特征提取阶段,我们使用了词嵌入(Word embedding)技术。词嵌入是一种将单词映射到低维向量空间的方法,它可以将单词的语义信息编码为向量表示。在本次任务中,我们参考了博客\[NLP-Beginner 任务二:基于深度学习文本分类\](https://pytorch.org/Convolutional Neural Networks for Sentence Classification)中的方法,使用了预训练的词嵌入模型。 神经网络设计阶段,我们采用了卷积神经网络(CNN)和循环神经网络(RNN)的结合。具体来说,我们使用了四个卷积核,大小分别为2×d, 3×d, 4×d, 5×d。这样设计的目的是为了挖掘词组的特征。例如,2×d的卷积核用于挖掘两个连续单词之间的关系。在模型中,2×d的卷积核用红色框表示,3×d的卷积核用黄色框表示。 最后,我们将模型的输出结果进行分类,得到文本的情感分类结果。这个任务的目标是通过深度学习模型对文本进行情感分类,以便更好地理解和分析文本数据。 #### 引用[.reference_title] - *1* *3* [NLP-Brginner 任务二:基于深度学习文本分类](https://blog.csdn.net/m0_61688615/article/details/128713638)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [NLP基本任务二:基于深度学习文本分类](https://blog.csdn.net/Mr_green_bean/article/details/90480918)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值