仅80行,这可能是你见过最简单的attention
这篇Markdown是用尽可能简单的步骤做一个文本的二分类任务,目的在于展现attention机制的实现流程,模型使用Bi-GRU+Attention机制,在只考虑batch_size为1的情况下,把Bi-GRU最后一层的hidden拼接起来当做query,每一个step的output合并成一个matrix当做key和value的值。
这里不再对Bi-GRU和Attention的原理进行解释,网上有很多很经典的解释,非常好理解。
attention流程图如下
第一步,加载包,定义参数和准备数据,为了尽可能简单直观,这里只用6个简短的句子作为训练集,并赋予0和1的标签,0代表good,1代表bad。由于样本量少,就不加入测试集和验证集,通过观察模型输出和loss变化来判断试验是否成功。
import torchtext
import collections
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import functional as F
output_dim=2
hid_dim=128
emb_dim=50
lr=0.001
epoch=10
text=['shopping is fun','good luck', 'i fill sick','thank you','what a bad day', 'such a bad girl']
label=[0,0,1,0,1,1]
第二步,打包数据并对文本进行分词,分词函数选择最简单的basic_english也就是按照英文空格来分词,讲数据和对应的标签打包为一个元组,不同的数据共同组成一个list
train=list(zip(text,label))
tokenizer=torchtext.data.utils.get_tokenizer('basic_english')
train=[(tokenizer(each),label) for each,label in train]
如果print(train)输出应该是:
[(['shopping', 'is', 'fun'], 0), (['good', 'luck'], 0), (['i', 'fill', 'sick'], 1), (['thank', 'you'], 0), (['what', 'a', 'bad', 'day'], 1), (['such', 'a', 'bad', 'girl'], 1)]
第三步,实现word to index
def make_vocab(data):
vocab_count=collections.Counter()
for sen,label in data:
vocab_count.update(sen)
vocab = torchtext.vocab.Vocab(vocab_count)
return vocab
vocab=make_vocab(train)
pad_idx=vocab['<pad>']
train2idx=[([vocab[word] for word in sen],label) for sen,label in train]
print(train2idx)
如果print(train2idx),输出应该是这样:
[([12, 10, 6], 0), ([8, 11], 0), ([9, 5, 13], 1), ([15, 17], 0), ([16, 2, 3, 4], 1), ([14, 2, 3, 7], 1)]
第四步,定义模型,优化函数和loss函数,注意,这里省略了模型初始化的操作,如果数据集比较大建议还是加上,模型初始化的代码会在markdown的最后加上。
class mymodule(nn.Module):
def __init__(self, input_dim, emb_dim, hid_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(input_dim+1, emb_dim)
self.gru = nn.GRU(emb_dim, hid_dim, bidirectional=True)
self.fc = nn.Linear(hid_dim*2, output_dim)
def attention(self,output,hidden):
key = value = output
query = hidden.T
alpha = F.softmax(torch.mm(key,query),dim=0).T
atten_out=torch.mm(alpha,value)
return atten_out
def forward(self, text):
last_hidden=None
embedded = self.embedding(text)
embedded=embedded.unsqueeze(1)
output, hidden = self.gru(embedded)
output=output.squeeze(1)
for mm in hidden:
if last_hidden is not None:
hidden=torch.cat((last_hidden,mm),dim=1)
last_hidden=mm
att_out=self.attention(output,hidden)
prediction=self.fc(att_out)
return prediction
input_dim=len(vocab)
model = mymodule(input_dim, emb_dim, hid_dim, output_dim)
optimizer = optim.Adam(model.parameters(),lr=lr)
criterion = nn.CrossEntropyLoss()
第五步,训练模型:
for i in range(epoch):
loss_list=[]
for inp,lab in train2idx:
inp=torch.tensor(inp)
lab=torch.tensor([lab])
prediction=model(inp)
loss = criterion(prediction, lab)
loss_list.append(float(loss))
loss.backward()
optimizer.step()
print(sum(loss_list)/6)
输出loss可以明显看出loss在下降,说明模型起到了作用。
第六步,进行简单的预测验证:
test_sen='shopping is fun'
word = tokenizer(test_sen)
test2idx=torch.tensor([vocab[each] for each in word])
prediction_test=model(test2idx)
if int(prediction_test.argmax(1))==0:
print('good')
if int(prediction_test.argmax(1))==1:
print('bad')
最终输出的loss和验证结果如下
0.7234548230965933
0.41153744608163834
0.18003544583916664
0.05795638697842757
0.016811293084174395
0.0047966171041480266
0.0012803817735402845
0.00031114658713704557
6.905176704918858e-05
1.386754926002709e-05
good
附录:模型初始化函数
def initialize_parameters(m):
if isinstance(m, nn.Embedding):
nn.init.uniform_(m.weight, -0.05, 0.05)
elif isinstance(m, nn.GRU):
for n, p in m.named_parameters():
if 'weight_ih' in n:
r, z, n = p.chunk(3)
nn.init.xavier_uniform_(r)
nn.init.xavier_uniform_(z)
nn.init.xavier_uniform_(n)
elif 'weight_hh' in n:
r, z, n = p.chunk(3)
nn.init.orthogonal_(r)
nn.init.orthogonal_(z)
nn.init.orthogonal_(n)
elif 'bias' in n:
r, z, n = p.chunk(3)
nn.init.zeros_(r)
nn.init.zeros_(z)
nn.init.zeros_(n)
elif isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
nn.init.zeros_(m.bias)
model.apply(initialize_parameters)