代码来自于博客:https://wmathor.com/index.php/archives/1445/
TextCNN代码的流程分析
TextCNN将图像领域的CNN运用到了文本领域中,在论文Convolutional Neural Networks for Sentence Classification中提出。
TextCNN最核心的就是如下所示的一张图:
图中左边红色方框代表输入的样本(共有两个方框,可以理解为batch_size=2),我们按照图中所示分析红色方框的句子,这个句子中有9个词,共有9行,每个单词对应的每行都是其词向量,词向量的维度是自定义的,图中列数为6,那么意味着每个词被编码成长度为6的向量。
接着看红色框中的黄色部分,这是一个卷积核,大小为3*6,宽度是自定义的,设置为多少即代表着你想编码上下文多少个单词的信息(因此CNN在一定程度上也能像RNNs一样,能整合上下文的编码),长度是固定的,其值跟词向量的维度相同,然后黄色方框中的信息被编码成一个1*1的小方块,由于宽度设置为3,那么卷积核在红色框上下移动时(假设移动的步长为1),则共能生成7个1*1的小方块,上图中中间有4个这样的长条,代表output_channel为4,这跟图像领域的CNN是相同的。在卷积后,进行池化压缩维度,最后展平输入到全连接层进行分类。
本任务的数据集是自定义的几句话,给每句话指定一个标签,完成文本分类(情感分类)任务。
完整代码
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F
dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 3 words sentences (=sequence_length is 3)
sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
labels = [1, 1, 1, 0, 0, 0] # 1 is good, 0 is not good.
# 模型的参数
embedding_size = 2 # 词向量维度
sequence_length = len(sentences[0]) # 每个训练样本(每句话)有多少个单词,这里为3
num_classes = len(set(labels)) # 分成多少个类别
batch_size = 3
word_list = " ".join(sentences).split()
vocab = list(set(word_list))
word2idx = {w: i for i, w in enumerate(vocab)}
vocab_size = len(vocab)
# 数据预处理
def make_data(sentences, labels):
inputs = []
for sen in sentences:
inputs.append([word2idx[n] for n in sen.split()])
targets = []
for out in labels:
targets.append(out)
return inputs, targets
input_batch, target_batch = make_data(sentences, labels)
input_batch, target_batch = torch.LongTensor(input_batch), torch.LongTensor(target_batch)
dataset = Data.TensorDataset(input_batch, target_batch)
loader = Data.DataLoader(dataset, batch_size, True)
print(input_batch)
print(target_batch)
"""
input_batch
tensor([[ 5, 6, 11],
[ 8, 10, 14],
[ 3, 15, 7],
[ 5, 13, 11],
[ 0, 2, 1],
[12, 4, 9]])
target_batch
tensor([1, 1, 1, 0, 0, 0])
"""
class TextCNN(nn.Module):
def __init__(self):
super(TextCNN, self).__init__()
self.W = nn.Embedding(vocab_size, embedding_size)
output_channel = 3
self.conv = nn.Sequential(
# conv : [input_channel(=1), output_channel, (filter_height, filter_width), stride=1]
nn.Conv2d(1, output_channel, (2, embedding_size)),
nn.ReLU(),
# pool : ((filter_height, filter_width))
nn.MaxPool2d((2, 1)),
)
# 全连接层,这里的output_channel其实是output_channel*1*1的省略写法
self.fc = nn.Linear(output_channel, num_classes)
def forward(self, X):
"""
X: [batch_size, sequence_length]
"""
batch_size = X.shape[0]
embedding_X = self.W(X) # [batch_size, sequence_length, embedding_size]
# 多增加一个维度,方便卷积
embedding_X = embedding_X.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
conved = self.conv(embedding_X) # [batch_size, output_channel*1*1]
# 卷积/池化后需要展平,输入到全连接层
flatten = conved.view(batch_size, -1)
output = self.fc(flatten)
return output
model = TextCNN().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Training
for epoch in range(5000):
for batch_x, batch_y in loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
pred = model(batch_x)
loss = criterion(pred, batch_y)
if (epoch + 1) % 1000 == 0:
print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Test
test_text = 'i hate me'
tests = [[word2idx[n] for n in test_text.split()]]
test_batch = torch.LongTensor(tests).to(device)
# Predict
model = model.eval()
predict = model(test_batch).data.max(1, keepdim=True)[1]
if predict[0][0] == 0:
print(test_text, "is Bad Mean...")
else:
print(test_text, "is Good Mean!!")
输入如下:
tensor([[ 8, 0, 3],
[14, 15, 10],
[ 1, 2, 9],
[ 8, 4, 3],
[13, 12, 5],
[11, 7, 6]])
tensor([1, 1, 1, 0, 0, 0])
Epoch: 1000 loss = 0.005091
Epoch: 1000 loss = 0.099171
Epoch: 2000 loss = 0.000431
Epoch: 2000 loss = 0.029610
Epoch: 3000 loss = 0.000047
Epoch: 3000 loss = 0.010530
Epoch: 4000 loss = 0.000039
Epoch: 4000 loss = 0.004006
Epoch: 5000 loss = 0.001597
Epoch: 5000 loss = 0.000006
i hate me is Bad Mean...