【NLP】【TextCNN】文本分类

myaijarvis

已于 2022-12-03 16:24:20 修改

阅读量494

点赞数 1

分类专栏： NLP Pytorch 文章标签：自然语言处理分类人工智能

于 2022-06-29 22:12:01 首次发布

本文链接：https://blog.csdn.net/Jruo911/article/details/125529983

版权

Pytorch 同时被 2 个专栏收录

20 篇文章 2 订阅

订阅专栏

NLP

15 篇文章 3 订阅

订阅专栏

TextCNN

必看：【参考：【论文笔记】：Convolutional Neural Networks for Sentence Classification用CNN做句子分类 - 小千同学超级爱写代码 - 博客园】

【参考：卷积在NLP领域的应用–以TextCNN为例_哔哩哔哩_bilibili】

【参考：论文阅读：Convolutional Neural Networks for Sentence Classification 卷积神经网络的句子分类_南有芙蕖-CSDN博客】

【参考：TextCNN天池授课_哔哩哔哩_bilibili】讲的非常好（有Pytorch代码讲解）
配套代码：【参考：Datawhale零基础入门NLP赛事 - Task5 基于深度学习的文本分类2-2TextCNN-天池实验室-实时在线的数据分析协作工具，享受免费计算资源】

代码也可以到这里搜索【参考：天池实验室-实时在线的数据分析协作工具，享受免费计算资源】
在这里插入图片描述

论文 1

参考论文1：《Convolutional Neural Networks for Sentence Classification 》用于句子分类的卷积神经网络 2014
在这里插入图片描述
图上的红线部分，窗口大小为2，每次选择两个词进行特征提取；

黄线部分窗口大小为3，每次选择三个词进行特征提取，也就是说，“窗口”的含义是“每次作用几个单词”，反应在图上就是“滤波器一次性遍历几行”；

论文2

参考论文：《A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification 》用于句子分类的卷积神经网络（和从业者指南）的敏感性分析 2016

TextCNN调参指导（现在常用的就是这个版本）

在这里插入图片描述

2个窗口为2的卷积核（黄色系），2个窗口为3的卷积核（绿色系），2个窗口为4的卷积核（红色系）

Pytorch实现（简化版）

【参考：TextCNN的PyTorch实现_哔哩哔哩_bilibili】
配套文章 https://wmathor.com/index.php/archives/1445/ 写的非常详细
在这里插入图片描述

import torch
import numpy as np
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F

dtype = torch.FloatTensor
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 3 words sentences (=sequence_length is 3)
sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
labels = [1, 1, 1, 0, 0, 0]  # 1 is good, 0 is not good.


# TextCNN 参数
embedding_size=2 # 一个词使用二维向量表示
sequence_length=len(sentences[0].split ()) # 3 序列长度 这里默认全部一样长，都是三个单词
num_classes=len(set(labels)) # 2
batch_size=3

word_list=" ".join(sentences).split() # 包含了sentences中所有的词 有重复的词
vocab=list(set(word_list)) # 词典 包含了sentences中所有的词
word2idx = {w:i for i,w in enumerate(vocab)} # 词：索引
vocab_size = len(vocab)


def make_data(sentences,labels):
    inputs=[]
    for sentence in sentences:
        inputs.append([word2idx[n] for n in sentence.split()]) # 把句子变成对应的索引序列

    targets=[]
    for out in labels:
        targets.append(out)

    return inputs,targets

input_batch,target_batch=make_data(sentences,labels)
input_batch,target_batch=torch.LongTensor(input_batch),torch.LongTensor(target_batch)

dataset=Data.TensorDataset(input_batch,target_batch)
loader=Data.DataLoader(dataset,batch_size,True)

input_batch

tensor([[ 4, 11, 14],
        [ 5,  9, 10],
        [15,  0, 12],
        [ 4,  6, 14],
        [ 1,  2, 13],
        [ 7,  3,  8]])

target_batch

tensor([1, 1, 1, 0, 0, 0])


from torch import nn


class TextCNN(nn.Module):
    def __init__(self):
        super(TextCNN, self).__init__()
        self.W=nn.Embedding(num_embeddings=vocab_size,embedding_dim=embedding_size)
        out_channels=3 
        self.conv = nn.Sequential(
            # conv : [input_channel(=1), output_channel, kernel_size=(filter_height, filter_width=embedding_size), stride=1]
            # 因为out_channels=3，所以会有3个卷积核与输入数据做卷积
            # 这里filter_height=2 只有一个卷积核
            nn.Conv2d(in_channels=1,out_channels=out_channels,kernel_size=(2,embedding_size))
            # 输出[batch_size, out_channels=3, 2, 1] out_channels行batch_size列给元素，每个元素是2*1的
            ,nn.ReLU()
             # pool : ((filter_height, filter_width))
            ,nn.MaxPool2d(kernel_size=(2,1)) # 把2*1的元素最大池化输出1*1
        )
        # fc
        self.fc = nn.Linear(in_features=out_channels,out_features=num_classes) # 输出二分类

    def forward(self, x):
        '''
            X: [batch_size, sequence_length]
        '''
        batch_size=x.shape[0] # 多少个句子
        # 变成一个立方体 比如[[ 4, 11, 14],...] 4表示一个单词，然后用词向量[1,2]来表示该词，即[[ [1,2], 11, 14],...]
        embedding_x = self.W(x) # [batch_size, sequence_length, embedding_size]
        # 在第二个维度的位置增加值为1的维度 即通道数，单通道，类似于图片的黑白图像
        # 即有一行batch_size列元素，每个元素都是sequence_length行embedding_size列
        # 现在的数据才能做卷积，因为在传统 CNN 中，输入数据就应该是 [batch_size, in_channel, height, width] 这种维度
        embedding_x = embedding_x.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
        conved=self.conv(embedding_x)  # [batch_size, output_channel,1,1]
        flatten=conved.view(batch_size,-1)  # [batch_size, output_channel*1*1]
        output = self.fc(flatten)
        return output

model=TextCNN().to(device=device)
criterion=nn.CrossEntropyLoss().to(device=device)
optimizer=optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(5000):
    for batch_x, batch_y in loader:
        batch_x=batch_x.to(device=device)
        batch_y=batch_y.to(device=device)
        pred=model(batch_x)
        loss = criterion(pred, batch_y)
        if (epoch +1) %1000 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Epoch: 1000 loss = 0.030200
Epoch: 1000 loss = 0.054546
Epoch: 2000 loss = 0.014919
Epoch: 2000 loss = 0.007824
Epoch: 3000 loss = 0.002666
Epoch: 3000 loss = 0.005158
Epoch: 4000 loss = 0.001931
Epoch: 4000 loss = 0.000988
Epoch: 5000 loss = 0.000379
Epoch: 5000 loss = 0.000743

# Test
test_text = 'i hate me'
tests = [[word2idx[n] for n in test_text.split()]]
test_batch = torch.LongTensor(tests).to(device)
# Predict
model = model.eval()
predict = model(test_batch).data.max(1, keepdim=True)[1]
if predict[0][0] == 0:
    print(test_text,"is Bad Mean...")
else:
    print(test_text,"is Good Mean!!")

i hate me is Bad Mean...

Pytorch实现2

【参考：nlp-tutorial/TextCNN.py at master · graykode/nlp-tutorial】

# %%
# code by Tae Hwan Jung @graykode
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class TextCNN(nn.Module):
    def __init__(self):
        super(TextCNN, self).__init__()
        self.num_filters_total = num_filters * len(filter_sizes)
        self.W = nn.Embedding(vocab_size, embedding_size)        
        self.filter_list = nn.ModuleList([
        						nn.Conv2d(1, num_filters, kernel_size=(size, embedding_size)) 
        						for size in filter_sizes
        						])
        
        self.Weight = nn.Linear(self.num_filters_total, num_classes, bias=False)  # 分类层
        self.Bias = nn.Parameter(torch.ones([num_classes]))

    def forward(self, X):
        embedded_chars = self.W(X) # [batch_size, sequence_length, sequence_length]
        # 加入通道数1
        embedded_chars = embedded_chars.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]

        pooled_outputs = []
        for i, conv in enumerate(self.filter_list):
            # conv : [input_channel(=1), output_channel(=3), (filter_height, filter_width), bias_option]
            h = F.relu(conv(embedded_chars)) # h:[batch_size(=6), output_channel(=3),output_height(=2), output_width(=1)]
            # mp : ((filter_height, filter_width))
            mp = nn.MaxPool2d((sequence_length - filter_sizes[i] + 1, 1)) # [2,1]
            # mp(h):[batch_size(=6),output_channel(=3), output_height(=1), output_width(=1)]
            # pooled : [batch_size(=6), output_height(=1), output_width(=1), output_channel(=3)]
            pooled = mp(h).permute(0, 3, 2, 1)
            pooled_outputs.append(pooled)

        h_pool = torch.cat(pooled_outputs, len(filter_sizes)) # [batch_size(=6), output_height(=1), output_width(=1), output_channel(=3) * 3]
        h_pool_flat = torch.reshape(h_pool, [-1, self.num_filters_total]) # [batch_size(=6), output_height * output_width * (output_channel * 3)]
        model = self.Weight(h_pool_flat) + self.Bias # [batch_size, num_classes]
        return model

if __name__ == '__main__':
    embedding_size = 2 # embedding size
    sequence_length = 3 # sequence length
    num_classes = 2 # number of classes
    # 论文中是2,3,4
    filter_sizes = [2, 2, 2] # n-gram windows # 卷积核大小 [filter_size,embedding_size]
    num_filters = 3 # number of filters # 3个卷积核会把输入数据变成三通道的数据

    # 3 words sentences (=sequence_length is 3)
    sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
    labels = [1, 1, 1, 0, 0, 0]  # 1 is good, 0 is not good.

    word_list = " ".join(sentences).split()
    word_list = list(set(word_list))
    word_dict = {w: i for i, w in enumerate(word_list)}
    vocab_size = len(word_dict)

    model = TextCNN()

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    inputs = torch.LongTensor([np.asarray([word_dict[n] for n in sen.split()]) for sen in sentences])
    targets = torch.LongTensor([out for out in labels]) # To using Torch Softmax Loss function

    # Training
    for epoch in range(5000):
        optimizer.zero_grad()
        output = model(inputs)

        # output : [batch_size, num_classes], target_batch : [batch_size] (LongTensor, not one-hot)
        loss = criterion(output, targets)
        if (epoch + 1) % 1000 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

        loss.backward()
        optimizer.step()

    # Test
    test_text = 'sorry hate you'
    tests = [np.asarray([word_dict[n] for n in test_text.split()])]
    test_batch = torch.LongTensor(tests)

    # Predict
    predict = model(test_batch).data.max(1, keepdim=True)[1]
    if predict[0][0] == 0:
        print(test_text,"is Bad Mean...")
    else:
        print(test_text,"is Good Mean!!")

Pytorch实现3 （待定）

【参考：手写AI出品: TextCNN文本分类,逐行代码复现! 可加UP免费答疑!_哔哩哔哩_bilibili】
【参考：A-series-of-NLP/文本分类/TextCNN_文本分类 at main · shouxieai/A-series-of-NLP】

天池代码

【参考：NLP-Baseline TextCNN-天池实验室-实时在线的数据分析协作工具，享受免费计算资源】

class config:
    def __init__(self):
        self.embedding_pretrained = None # 是否使用预训练的词向量
        self.n_vocab = 100 # 词表中单词的个数
        self.embed_size = 300 # 词向量的维度 
        self.cuda = False # 是否使用gpu
        self.filter_num = 100 # 每种尺寸卷积核的个数
        self.filters = [3,4,5] # 卷积核的尺寸
        self.label_num = 2 # 标签个数
        self.dropout = 0.5 # dropout的概率
        self.sentence_max_size = 50 #最大句子长度

class TextCNN(BasicModule):
    def __init__(self, config):
        super(TextCNN, self).__init__()
        # 嵌入层
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed_size) # batchsize*l
        # 卷积层
        self.conv1d_1 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[0])  # channel
        self.conv1d_2 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[1])
        self.conv1d_3 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[2])
        # 池化层
        self.Max_pool_1 = nn.MaxPool1d(config.sentence_max_size-3+1)
        self.Max_pool_2 = nn.MaxPool1d(config.sentence_max_size-4+1)
        self.Max_pool_3 = nn.MaxPool1d(config.sentence_max_size-5+1)
        # Dropout层
        self.dropout = nn.Dropout(config.dropout)
        #分类层
        self.fc = nn.Linear(config.filter_num*len(config.filters), config.label_num)
    def forward(self, x):
        x = x.long()
        out = self.embedding(x) # bs *length*embedding_size
        out = out.transpose(1, 2).contiguous() # bs*embedding_size*length
        x1 = F.relu(self.conv1d_1(out))
        x2 = F.relu(self.conv1d_2(out))
        x3 = F.relu(self.conv1d_3(out))
        x1 = self.Max_pool_1(x1).squeeze()
        x2 = self.Max_pool_2(x2).squeeze()
        x3 = self.Max_pool_3(x3).squeeze()
        print (x1.size(),x2.size(),x3.size())
        out = torch.cat([x1,x2,x3], 1)
        out = self.dropout(out)
        out = self.fc(out)
        return out