NLP-04-TextCNN阅读笔记

标题:Convolutional Neural Networks for Sentence Classification

发表期刊:EMNLP2014

单位:纽约大学

 

1、网络结构

总体来说,网络由一个卷积层、一个最大池化层和一个全连接层组成

 

class TextCNN(BasicModule):

    def __init__(self, config):
        super(TextCNN, self).__init__()
        # 嵌入层
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed_size) # batchsize*l
        # 卷积层
        self.conv1d_1 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[0])  # channel
        self.conv1d_2 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[1])
        self.conv1d_3 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[2])
        # 池化层
        self.Max_pool_1 = nn.MaxPool1d(config.sentence_max_size-3+1)
        self.Max_pool_2 = nn.MaxPool1d(config.sentence_max_size-4+1)
        self.Max_pool_3 = nn.MaxPool1d(config.sentence_max_size-5+1)
        # Dropout层
        self.dropout = nn.Dropout(config.dropout)
        #分类层
        self.fc = nn.Linear(config.filter_num*len(config.filters), config.label_num)
    def forward(self, x):
        x = x.long()
        out = self.embedding(x) # bs *length*embedding_size
        out = out.transpose(1, 2).contiguous() # bs*embedding_size*length
        x1 = F.relu(self.conv1d_1(out))
        x2 = F.relu(self.conv1d_2(out))
        x3 = F.relu(self.conv1d_3(out))
        x1 = self.Max_pool_1(x1).squeeze()
        x2 = self.Max_pool_2(x2).squeeze()
        x3 = self.Max_pool_3(x3).squeeze()
        print (x1.size(),x2.size(),x3.size())
        out = torch.cat([x1,x2,x3], 1)
        out = self.dropout(out)
        out = self.fc(out)
        return out

1.1、正则化

1.2、TextCNN应用

2、实验结果与分析

 2.1、词向量的影响

  •  上图展示的实验仅采用的词向量不同,分别使用glove和word2vec;均采用静态单通道的模型
  •  实验发现word2vec和glove获得的表现相当(不分上下)
  •  使用两个词向量的方式并不能使结果更好

2.2、卷积核大小的影响

  •  今日启发:论文绘图要使用矢量图

 

  •   
  • 联合最优附近的多个尺寸一般优于多个最优尺寸(还要具体实验具体尝试,表5中与该结论相反)

2.3、卷积核个数的影响

  • 实验启发:卷积核数量尽量大于100

2.4、激活函数的影响

2.5、Dropout失活率的影响

  •  大部分超过0.7之后表现变差

3、论文总结

 

4、代码实现

MR数据集 https://www.cs.cornell.edu/people/pabo/movie-review-data/ 上下载
SST数据集 https://nlp.stanford.edu/sentiment/ 上下载
word2vec下载 https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
国内地址:https://pan.baidu.com/s/1jJ9eAaE
glove下载 http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
国内地址:https://apache-mxnet.s3.cn-north-1.amazonaws.com.cn/gluon/embeddings/glove/glove.840B.300d.zip

数据集构建

 

# coding:utf-8
from torch.utils import data
import os
import random
import numpy as np
import nltk
import torch
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors


class MR_Dataset(data.Dataset):
    def __init__(self, state="train", k=0, embedding_type="word2vec"):

        self.path = os.path.abspath('.')
        if "data" not in self.path:
            self.path += "/data"
        # 导入数据集
        pos_samples = open(self.path + "/MR/rt-polarity.pos", errors="ignore").readlines()  # 按行读取数据
        neg_samples = open(self.path + "/MR/rt-polarity.neg", errors="ignore").readlines()
        datas = pos_samples + neg_samples
        # datas = [nltk.word_tokenize(data) for data in datas]
        datas = [data.split() for data in datas] # 按空格作为分隔读取,每个样本(一句话)为一个list,其中每个元素为一个词
        labels = [1] * len(pos_samples) + [0] * len(neg_samples) # 生成正负样本标签

        word2id = {"<pad>": 0}  # 构建 word2id 并padding成相同长度
        max_sample_length = max([len(sample) for sample in datas])  # 求句子最大长度,将所有句子pad成一样的长度
        for i, data in enumerate(datas):
            for j, word in enumerate(data):
                if word2id.get(word) == None: # 如果该词不在 word2id 中,将该词加入,并将当前列表长度作为该词id
                    word2id[word] = len(word2id)
                datas[i][j] = word2id[word]  # 将 数据中的词 替换为 id
            datas[i] = datas[i] + [0] * (max_sample_length - len(datas[i]))  # 将每句话padding成相同长度(此处采用最大的句子长度)
            # datas[i] = datas[i][0:max_sample_length] + [0] * (max_sample_length - len(datas[i])) #包含截断的写法

        self.n_vocab = len(word2id)
        self.word2id = word2id
        if embedding_type == "word2vec":
            self.get_word2vec()
        elif embedding_type == "glove":
            self.get_glove_embedding()
        else:
            pass

        c = list(zip(datas, labels))  # 打乱训练集
        random.seed(1)
        random.shuffle(c)
        datas[:], labels[:] = zip(*c)

        if state == "train":  # 生成训练集
            # k折交叉验证,假设k=3,则加号前面表示读取 0-2折,加号后面读取 4-10折
            self.datas = datas[:int(k * len(datas) / 10)] + datas[int((k + 1) * len(datas) / 10):]
            self.labels = labels[:int(k * len(datas) / 10)] + labels[int((k + 1) * len(labels) / 10):]
            self.datas = np.array(self.datas[0:int(0.9 * len(self.datas))])  # 前9折作为训练集
            self.labels = np.array(self.labels[0:int(0.9 * len(self.labels))])
        elif state == "valid":  # 生成验证集
            self.datas = datas[:int(k * len(datas) / 10)] + datas[int((k + 1) * len(datas) / 10):]
            self.labels = labels[:int(k * len(datas) / 10)] + labels[int((k + 1) * len(labels) / 10):]
            self.datas = np.array(self.datas[int(0.9 * len(self.datas)):])   # 再切分1折作为验证集(最后一折)
            self.labels = np.array(self.labels[int(0.9 * len(self.labels)):])
        elif state == "test":  # 生成测试集
            # 第 k 折作为测试集
            self.datas = np.array(datas[int(k * len(datas) / 10):int((k + 1) * len(datas) / 10)])
            self.labels = np.array(labels[int(k * len(datas) / 10):int((k + 1) * len(datas) / 10)])

    def __getitem__(self, index):
        return self.datas[index], self.labels[index]

    def __len__(self):
        return len(self.datas)

    def get_glove_embedding(self):
        """
        生成glove词向量
        :return: 根据词表生成词向量
        """
        if not os.path.exists(self.path + "/glove_embedding_mr.npy"):  # 如果已经保存了词向量,就直接读取
            if not os.path.exists(self.path + "/test_word2vec.txt"):
                glove_file = datapath(self.path + '/glove.840B.300d.txt')
                # 指定转化为word2vec格式后文件的位置
                tmp_file = get_tmpfile(self.path + "/glove_word2vec.txt")
                from gensim.scripts.glove2word2vec import glove2word2vec
                glove2word2vec(glove_file, tmp_file)
            else:
                tmp_file = get_tmpfile(self.path + "/glove_word2vec.txt")
            print("Reading Glove Embedding...")
            wvmodel = KeyedVectors.load_word2vec_format(tmp_file)

            tmp = []  # 将数据集中的所有词存入tmp,计算词向量的均值和标准差
            for word, index in self.word2id.items():
                try:
                    tmp.append(wvmodel.get_vector(word))
                except:
                    pass
            mean = np.mean(np.array(tmp))
            std = np.std(np.array(tmp))
            print(mean, std)

            vocab_size = self.n_vocab  # 词表大小
            embed_size = 300
            embedding_weights = np.random.normal(mean, std, [vocab_size, embed_size])  # 正太分布随机初始化词向量矩阵
            for word, index in self.word2id.items():
                try:
                    embedding_weights[index, :] = wvmodel.get_vector(word) # 如果预训练的词向量中存在该词,则替换
                except:
                    pass
            np.save(self.path + "/glove_embedding_mr.npy", embedding_weights)  # 保存生成的词向量
        else:
            embedding_weights = np.load(self.path + "/glove_embedding_mr.npy")  # 载入生成的词向量
        self.weight = embedding_weights

    def get_word2vec(self):
        """
        生成word2vec词向量
        :return: 根据词表生成的词向量
        """
        if not os.path.exists(self.path + "/word2vec_embedding_mr.npy"):  # 如果已经保存了词向量,就直接读取
            print("Reading word2vec Embedding...")
            wvmodel = KeyedVectors.load_word2vec_format(self.path + "/GoogleNews-vectors-negative300.bin.gz",
                                                        binary=True)
            tmp = [] # 统计word2vec所有词的均值和标准差
            for word, index in self.word2id.items():
                try:
                    tmp.append(wvmodel.get_vector(word))
                except:
                    pass
            mean = np.mean(np.array(tmp))
            std = np.std(np.array(tmp))
            print(mean, std)

            vocab_size = self.n_vocab
            embed_size = 300
            embedding_weights = np.random.normal(mean, std, [vocab_size, embed_size])  # 正太分布初始化方法
            for word, index in self.word2id.items():
                try:
                    embedding_weights[index, :] = wvmodel.get_vector(word)
                except:
                    pass
            np.save(self.path + "/word2vec_embedding_mr.npy", embedding_weights)  # 保存生成的词向量
        else:
            embedding_weights = np.load(self.path + "/word2vec_embedding_mr.npy")  # 载入生成的词向量
        self.weight = embedding_weights


if __name__ == "__main__":
    mr_train_dataset = MR_Dataset()
    print(mr_train_dataset.__len__())
    print(mr_train_dataset[0])
    mr_valid_dataset = MR_Dataset("valid")
    print(mr_valid_dataset.__len__())
    print(mr_valid_dataset[0])
    mr_test_dataset = MR_Dataset("test")
    print(mr_test_dataset.__len__())
    print(mr_test_dataset[0])

模型构建

# -*- coding: utf-8 -*-
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
from .BasicModule import BasicModule


class TextCNN(BasicModule):

    def __init__(self, config):
        super(TextCNN, self).__init__()
        # 嵌入层
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)  # 加载预训练的词向量
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed_size)
        # 卷积层
        if config.cuda:
            self.convs = [nn.Conv1d(config.embed_size, config.filter_num, filter_size).cuda()
                          for filter_size in config.filters]
        else:
            self.convs = [nn.Conv1d(config.embed_size, config.filter_num, filter_size)
                          for filter_size in config.filters]
        # Dropout层
        self.dropout = nn.Dropout(config.dropout)
        # 分类层
        self.fc = nn.Linear(config.filter_num*len(config.filters), config.label_num)
    def conv_and_pool(self,x,conv):
        x = F.relu(conv(x))
        # 池化层
        x = F.max_pool1d(x,x.size(2)).squeeze(2)
        return x
    def forward(self, x):
        out = self.embedding(x) #输出尺寸: batch_size*sentence_length*embedding_size
        out = out.transpose(1, 2).contiguous() # batch_size*embedding_size*sentence_length
        out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1) # batch_size*(filter_num*len(filters))
        out = self.dropout(out)
        out = self.fc(out) # batch_size*label_num
        return out


if __name__ == '__main__':
    print('running the TextCNN...')

训练&测试

# -*- coding: utf-8 -*-
from pytorchtools import EarlyStopping
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
from model import TextCNN
from data import MR_Dataset
import numpy as np
import config as argumentparser
config = argumentparser.ArgumentParser()
config.filters = list(map(int,config.filters.split(",")))  # 将卷积核尺寸按逗号分隔并转为int类型
torch.manual_seed(config.seed)


if torch.cuda.is_available():
    torch.cuda.set_device(config.gpu)


# Create the configuration
def get_test_result(data_iter,data_set):
    # 生成测试结果
    model.eval()
    data_loss = 0
    true_sample_num = 0
    for data, label in data_iter:
        if config.cuda and torch.cuda.is_available():
            data = data.cuda()
            label = label.cuda()
        out = model(data)
        loss = criterion(out, autograd.Variable(label.long()))
        data_loss += loss.data.item()
        true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy())
        # argmax(input,dim)函数用于把input的dim维度的值设为 input在dim维最大值所在索引
    acc = true_sample_num / data_set.__len__()
    return data_loss,acc


acc = 0

for i in range(0,10):  # 10-cv 交叉验证
    early_stopping = EarlyStopping(patience=10, verbose=True,cv_index=i)

    training_set = MR_Dataset(state="train",k=i,embedding_type=config.embedding_type)
    training_iter = torch.utils.data.DataLoader(dataset=training_set,
                                                batch_size=config.batch_size,
                                                shuffle=True,
                                                num_workers=2)

    if config.use_pretrained_embed:
        config.embedding_pretrained = torch.from_numpy(training_set.weight).float()  # 读取预训练的词向量
    else:
        config.embedding_pretrained = False

    config.n_vocab = training_set.n_vocab

    valid_set = MR_Dataset(state="valid", k=i,embedding_type="no")
    valid_iter = torch.utils.data.DataLoader(dataset=valid_set,
                                 batch_size=config.batch_size,
                                 shuffle=False,
                                 num_workers=2)
    test_set = MR_Dataset(state="test", k=i,embedding_type="no")

    test_iter = torch.utils.data.DataLoader(dataset=test_set,
                                batch_size=config.batch_size,
                                shuffle=False,
                                num_workers=2)

    model = TextCNN(config)
    if config.cuda and torch.cuda.is_available():
        model.cuda()
        config.embedding_pretrained.cuda()

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
    count = 0
    loss_sum = 0
    for epoch in range(config.epoch):
        # 开始训练
        model.train()
        for data, label in training_iter:
            if config.cuda and torch.cuda.is_available():
                data = data.cuda()
                label = label.cuda()
            else:
                data = torch.autograd.Variable(data).long()
            label = torch.autograd.Variable(label).squeeze()
            out = model(data)
            l2_loss = config.l2*torch.sum(torch.pow(list(model.parameters())[1],2)) # 该处只使用了全连接层的 w 部分
            loss = criterion(out, autograd.Variable(label.long()))+l2_loss
            loss_sum += loss.data.item()
            count += 1
            if count % 100 == 0:  # 迭代100次输出一次loss
                print("epoch", epoch, end='  ')
                print("The loss is: %.5f" % (loss_sum / 100))
                loss_sum = 0
                count = 0
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            # save the model in every epoch
        # 一轮训练结束,在验证集测试
        valid_loss,valid_acc = get_test_result(valid_iter,valid_set)
        early_stopping(valid_loss, model)
        print ("The valid acc is: %.5f" % valid_acc)
        if early_stopping.early_stop:
            print("Early stopping")
            break
    # 1 fold训练结果
    model.load_state_dict(torch.load('./checkpoints/checkpoint%d.pt'%i))
    test_loss, test_acc = get_test_result(test_iter, test_set)
    print("The test acc is: %.5f" % test_acc)
    acc+=test_acc/10
# 输出10-fold的平均acc
print("The test acc is: %.5f" % acc)

 

本文为深度之眼paper论文班的学习笔记,仅供自己学习使用,如有问题欢迎讨论!关于课程可以扫描下图二维码

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值