CS231n-assignment3-Transformer_Captioning

最新推荐文章于 2023-07-24 22:51:42 发布

Esaka7

最新推荐文章于 2023-07-24 22:51:42 发布

阅读量2.4k

点赞数

分类专栏：卷积神经网络与视觉识别文章标签： python 深度学习人工智能算法 caption

本文链接：https://blog.csdn.net/qq_45978858/article/details/119323092

版权

之前已经实现了一个vanilla RNN和用于生成图像标题的任务。在本笔记本中，您将实现变压器解码器的关键部分，以完成相同的任务。
跟之前一样
ln[1]:

# Setup cell.
import time, os, json
import numpy as np
import matplotlib.pyplot as plt

from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.transformer_layers import *
from cs231n.captioning_solver_transformer import CaptioningSolverTransformer
from cs231n.classifiers.transformer import CaptioningTransformer
from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
from cs231n.image_utils import image_from_url

#%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # Set default size of plots.
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

ln[2]:

# Load COCO data from disk into a dictionary.
data = load_coco_data(pca_features=True)

# Print out all the keys and values from the data dictionary.
for k, v in data.items():
    if type(v) == np.ndarray:
        print(k, type(v), v.shape, v.dtype)
    else:
        print(k, type(v), len(v))

Transformer: Multi-Headed Attention
对于同一个文本，一个Attention获得一个表示空间，如果多个Attention，则可以获得多个不同的表示空间
Multi-Head Attention为Attention提供了多个“representation subspaces”。因为在每个Attention中，采用不同的Query / Key / Value权重矩阵，每个矩阵都是随机初始化生成的。然后通过训练，将词嵌入投影到不同的“representation subspaces（表示子空间）”中。

在transformer中，我们执行self-attention，这意味着values、keys和查询都来自输入𝑋∈ℝ^ℓ×𝑑，其中ℓ是我们的序列长度。具体来说，我们学习参数矩阵𝑉，𝐾，𝑄∈ℝ^𝑑×𝑑来映射我们的输入𝑋如下:

在multi-headed注意的情况下，我们学习每个head的参数矩阵，使模型更能表达注意输入的不同部分。设ℎ为head数，𝑌𝑖为 $head_𝑖$ 的注意力输出。因此，我们学习单个矩阵𝑄𝑖，𝐾𝑖和𝑉𝑖。保持我们的整体计算的head的情况下,我们选择𝑄𝑖∈ℝ^{𝑑×𝑑/ℎ} 𝐾𝑖∈ℝ^{𝑑×𝑑/ℎ}和𝑉𝑖∈ℝ^{𝑑×𝑑/ℎ}。在上面简单的点积上加上一个比例项 $\frac{1}{\sqrt{d/h}}$ ，我们得到

其中𝑌𝑖∈ℝ^{ℓ×𝑑/ℎ}，ℓ为序列长度。
在我们的实现中，我们在这里应用dropout(尽管实际上它可以在任何步骤中使用):

最后，自我注意的输出是heads级联的线性变换:𝑌=[𝑌1;…;𝑌ℎ]𝐴 (𝐴∈ℝ^𝑑×𝑑和[𝑌1;…;𝑌ℎ]∈ℝ^ℓ×𝑑)

完成cs231n/transformer_layers.py中MultiHeadAttention

class MultiHeadAttention(nn.Module):
    """
    A model layer which implements a simplified version of masked attention, as
    introduced by "Attention Is All You Need" (https://arxiv.org/abs/1706.03762).

    Usage:
      attn = MultiHeadAttention(embed_dim, num_heads=2)

      # self-attention
      data = torch.randn(batch_size, sequence_length, embed_dim)
      self_attn_output = attn(query=data, key=data, value=data)

      # attention using two inputs
      other_data = torch.randn(batch_size, sequence_length, embed_dim)
      attn_output = attn(query=data, key=other_data, value=other_data)
    """

    def __init__(self, embed_dim, num_heads, dropout=0.1):
        """
        Construct a new MultiHeadAttention layer.

        Inputs:
         - embed_dim: Dimension of the token embedding
         - num_heads: Number of attention heads
         - dropout: Dropout probability
        """
        super().__init__()
        assert embed_dim % num_heads == 0

        # We will initialize these layers for you, since swapping the ordering
        # would affect the random number generation (and therefore your exact
        # outputs relative to the autograder). Note that the layers use a bias
        # term, but this isn't strictly necessary (and varies by
        # implementation).
        self.key = nn.Linear(embed_dim, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.proj = nn.Linear(embed_dim, embed_dim)