之前已经实现了一个vanilla RNN和用于生成图像标题的任务。在本笔记本中,您将实现变压器解码器的关键部分,以完成相同的任务。
跟之前一样
ln[1]:
# Setup cell.
import time, os, json
import numpy as np
import matplotlib.pyplot as plt
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.transformer_layers import *
from cs231n.captioning_solver_transformer import CaptioningSolverTransformer
from cs231n.classifiers.transformer import CaptioningTransformer
from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
from cs231n.image_utils import image_from_url
#%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # Set default size of plots.
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
%load_ext autoreload
%autoreload 2
def rel_error(x, y):
""" returns relative error """
return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
ln[2]:
# Load COCO data from disk into a dictionary.
data = load_coco_data(pca_features=True)
# Print out all the keys and values from the data dictionary.
for k, v in data.items():
if type(v) == np.ndarray:
print(k, type(v), v.shape, v.dtype)
else:
print(k, type(v), len(v))
Transformer: Multi-Headed Attention
对于同一个文本,一个Attention获得一个表示空间,如果多个Attention,则可以获得多个不同的表示空间
Multi-Head Attention为Attention提供了多个“representation subspaces”。因为在每个Attention中,采用不同的Query / Key / Value权重矩阵,每个矩阵都是随机初始化生成的。然后通过训练,将词嵌入投影到不同的“representation subspaces(表示子空间)”中。
在transformer中,我们执行self-attention,这意味着values、keys和查询都来自输入𝑋∈ℝℓ×𝑑,其中ℓ是我们的序列长度。具体来说,我们学习参数矩阵𝑉,𝐾,𝑄∈ℝ𝑑×𝑑来映射我们的输入𝑋如下:
在multi-headed注意的情况下,我们学习每个head的参数矩阵,使模型更能表达注意输入的不同部分。设ℎ为head数,𝑌𝑖为 h e a d 𝑖 head_𝑖 headi的注意力输出。因此,我们学习单个矩阵𝑄𝑖,𝐾𝑖和𝑉𝑖。保持我们的整体计算的head的情况下,我们选择𝑄𝑖∈ℝ𝑑×𝑑/ℎ 𝐾𝑖∈ℝ𝑑×𝑑/ℎ和𝑉𝑖∈ℝ𝑑×𝑑/ℎ。在上面简单的点积上加上一个比例项 1 d / h \frac{1}{\sqrt{d/h}} d/h1,我们得到
其中𝑌𝑖∈ℝℓ×𝑑/ℎ,ℓ为序列长度。
在我们的实现中,我们在这里应用dropout(尽管实际上它可以在任何步骤中使用):
最后,自我注意的输出是heads级联的线性变换:𝑌=[𝑌1;…;𝑌ℎ]𝐴 (𝐴∈ℝ𝑑×𝑑和[𝑌1;…;𝑌ℎ]∈ℝℓ×𝑑)
完成cs231n/transformer_layers.py中MultiHeadAttention
class MultiHeadAttention(nn.Module):
"""
A model layer which implements a simplified version of masked attention, as
introduced by "Attention Is All You Need" (https://arxiv.org/abs/1706.03762).
Usage:
attn = MultiHeadAttention(embed_dim, num_heads=2)
# self-attention
data = torch.randn(batch_size, sequence_length, embed_dim)
self_attn_output = attn(query=data, key=data, value=data)
# attention using two inputs
other_data = torch.randn(batch_size, sequence_length, embed_dim)
attn_output = attn(query=data, key=other_data, value=other_data)
"""
def __init__(self, embed_dim, num_heads, dropout=0.1):
"""
Construct a new MultiHeadAttention layer.
Inputs:
- embed_dim: Dimension of the token embedding
- num_heads: Number of attention heads
- dropout: Dropout probability
"""
super().__init__()
assert embed_dim % num_heads == 0
# We will initialize these layers for you, since swapping the ordering
# would affect the random number generation (and therefore your exact
# outputs relative to the autograder). Note that the layers use a bias
# term, but this isn't strictly necessary (and varies by
# implementation).
self.key = nn.Linear(embed_dim, embed_dim)
self.query = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
self.proj = nn.Linear(embed_dim, embed_dim)