Pix2Seq 算法阅读记录

目录

前向传播过程

Tokenizer

训练过程:

数据处理

网络的输入以及标签

损失函数的计算 

网络结构


前向传播过程

batch_preds--> tgt-->tgt=cat(tgt, padding)-->tgt_embedding
                                          -->tgt_mask,tgt_padding_mask
image 送入Encoder

1、batch_preds 的shape 为 (batch size, 1)其中填充的都是开始token标志, 404

2、tgt 的 shape 为 (batch size, 299), 其中 索引0处为开始token标志 406,其余位置为填充token标志 406, 其长度 299 由 max_length-1得到。

3、tgt_embedding 为词嵌入向量,shape为 (batch size, length, embedding dim),其中 length为句子长度, embedding dim 为词典的通道数。

     tgt_mask 对tgt进行mask,常用Shape为(词数, 词数);tgt_padding_mask对tgt的token进行mask。常用Shape为(batch_size, 词数)

4、image 经过Encoder,输出的shape为(batch_size, num patch, 词向量维度),num patch是划分的patch的数量,

其中Embedding词向量转换见这里

如果说用语言翻译来做一个比方,则输入的图像就如文中所述的那样

它就是一种“方言” ,目标就是根据它生成表示 边界框坐标以及类别的 “语言”。

1、经过Transformer后的输出送入线性层,使得预测的向量shape转变为  (batch size,词数,词向量维度)--->(batch size,词数,字典单词数量),输出预测单词的概率分布

2、然后每次按词数通道,只拿处索引length-1处的词向量。注意,这里是自回归方式,随着每次的循环,这里的 length都会加1,所以索引会逐次增加。这也就表明逐次按位置顺序预测。

3、使用最大概率采样方式,从预测的分布中得到token标记。

4、将预测的token标记 加入到batch preds中,依次执行循环

5、一共执行 max len次循环,这里的max len是预测的句子的最大长度,而不是输入的句子的最大长度,这两个不一样的

6、在输出预测 label token 标记时,会用softmax处理一下preds,从而得到表示该次预测类别的置信度。

decoder 的输出 (1,299,256)

---->> 1所在的维度表示batch size的维度,一次性处理几张图片。相当于NLP中的一次性处  理几个句子

          299所在维度表示预测的序列的最大长度,相当于NLP中预测的句子的最大长度包含多少个单词。最大长度,但是实际预测的时候并没有都用到这个最大长度,这个最大长度应该是在训练时设置的,再测试时代码中采取了只预测101个序列。而且5个长度才能预测一个bbox,所以这里最多预测20个目标。实际上,这个101可以自行设置,

    with torch.no_grad():
        for i in range(max_len):  # 这里 max_len=CFG.generation_steps

       

注意,代码中用自回归方式进行预测时,每次的图片输入保持不变,变的是 batch_preds这个。 而 batch_preds的更新会更新 padding

length = tgt.size(1)
padding = torch.ones(tgt.size(0), CFG.max_len-length-1).fill_(CFG.pad_idx).long().to(tgt.device)  # padding 随着 length的变大而变小
tgt = torch.cat([tgt, padding], dim=1)  # 这里的总长度不变,保持为(1,299)

进而更新 tgt_padding_mask,而这个tgt_mask 是恒定不变的,为下三角矩阵

tgt_mask

tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [0., 0., 0.,  ..., 0., -inf, -inf],
        [0., 0., 0.,  ..., 0., 0., -inf],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

对输出的后处理过程

1、首先计算第一个结束token标记的索引,然后看看 该索引-1 是否是5的倍数。这是因为 如果把结束标记与开始标记都算进预测的句子长度,bbox + class 五个一组,而索引又是从0开始,所以只需所以减1就相当于句子长度去掉了开始 和 结束标记

2、然后进行反量化,得到最终的预测结果

Tokenizer

采用自回归方式生成预测,前向过程后生成的预测结果可视化如下

其中的404由

num_bins + class

 得出。实际离散化后包含406个标记,因为加入了开始(404)和结束(405)标记,以及填充标记(406)

得到上述的网络的输出预测后,开始对这些进行处理。 

1、 得到第一个结束标志 EOS 的索引 index
2、 判断 index-1 是否是 5 的倍数,若不是,则本次的预测不进行处理,默认没有检测到任何目标
3、 去掉额外填充的噪声
4、 迭代的每次拿出5个token
5、 前4维 为 box的信息,第5维为类别信息
6、 预测的表示类别的离散化token需要减去 num_bins,才是最后的类别
7、 box 反离散化, box / (num_bins - 1), 这个是输出特征尺度下的归一化的box的坐标
8、 将box的尺度返回输入图片的尺度, box的信息为 (Xmin,Ymin,Xmax,Ymax)

训练过程:

数据处理

df

classes 为类别的种类,一共20个类别

['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']

 采样数据

 sample 为一张图片中的目标标签,如这里包含8个目标

进行 增强,即transformed 

接下来制作 输入样本

1、对bboxes进行量化处理,

\frac{x}{width}*(num\_bins-1),

同理y轴也一样,不过除以的是height。对标签的量化直接

labels+num\_bins

2、将标签和量化后的bboxes 配对,随机打乱顺序

3、制作 token 序列

# list bbox
[303, 180, 364, 304]

# 接着后面加入标签
[303, 180, 364, 304, 399]

# 前面加入 开始标志 BOS token
[404, 303, 180, 364, 304, 399]

# 接下来循环执行,最后 示例如下
[404, 303, 180, 364, 304, 399, 197, 187, 251, 271, 399, 34, 172, 65, 217, 399, 94, 152, 147, 253, 399, 62, 160, 100, 247, 399, 247, 151, 306, 285, 399, 4, 156, 47, 213, 399, 137, 128, 196, 260, 399]

#最后加入结束标志 EOS token
[404, 303, 180, 364, 304, 399, 197, 187, 251, 271, 399, 34, 172, 65, 217, 399, 94, 152, 147, 253, 399, 62, 160, 100, 247, 399, 247, 151, 306, 285, 399, 4, 156, 47, 213, 399, 137, 128, 196, 260, 399, 405]

至此,单个图片的token序列制作完毕

4、只是单个图片的token序列完成还不算完。因为文章想要模仿NLP的语句生成任务,所以,传入的 token 序列的长度必须有一个预设值的最大长度,也就是NLP中的提示的最大句子长度。

>>1) 按batch进行处理,对每个 batch中的 token 序列长度进行补齐,按当前batch 中 token序列最长的补 

seq_batch = pad_sequence(
        seq_batch, padding_value=pad_idx, batch_first=True)

语法详见这里

>>2) 然后设置填充 pad, 值全为406, 并与补齐后的 seq_batch拼接在一起,最终生成 max_length的token 序列。

至此,输入的token序列的形式处理完成

网络的输入以及标签

y_input (4,299)

y_input = y[:, :-1]

tensor([[404,  48, 186,  ..., 406, 406, 406],
        [404,  60,  62,  ..., 406, 406, 406],
        [404, 150,  24,  ..., 406, 406, 406],
        [404, 154, 259,  ..., 406, 406, 406]], device='cuda:0')

y_expected (4,299)

y_expected = y[:, 1:]

tensor([[ 48, 186, 280,  ..., 406, 406, 406],
        [ 60,  62, 336,  ..., 406, 406, 406],
        [150,  24, 225,  ..., 406, 406, 406],
        [154, 259, 381,  ..., 406, 406, 406]], device='cuda:0')

 输入为带 开始标志的BOS 的token 序列, 标签为 带结束标志的EOS的token序列。这个 根据 NLP  任务的设计,与语句生成任务差不多。 因为 transformer的并行处理能力,所以不需要串行去训练,一次直接训练所有token序列。

损失函数的计算 

 preds (4,299,407)

tensor([[[-3.4124e-01, -2.8577e-01,  1.2335e+00,  ...,  2.0368e+00,
           1.0847e+00, -1.6405e-03],
         [-6.6422e-01, -5.3049e-01,  9.1888e-01,  ...,  2.4217e+00,
           5.7358e-01, -2.9166e-02],
         [-1.0293e+00, -1.0462e+00,  1.2401e+00,  ...,  1.5686e+00,
           1.4248e-01,  5.7978e-01],

         [-4.2677e-01,  5.1037e-01,  1.0064e+00,  ...,  8.6959e-01,
          -8.7426e-01,  7.5206e-01],
         [ 5.4397e-02,  2.1306e-01,  1.3231e+00,  ...,  7.0896e-01,
          -8.9445e-01,  1.6835e-01],
         [ 6.6183e-01,  9.4707e-01,  1.0529e+00,  ...,  8.9192e-02,
          -3.9311e-01,  6.6683e-01]]], device='cuda:0', grad_fn=<AddBackward0>)

input (1196,407) 

tensor([[-3.4124e-01, -2.8577e-01,  1.2335e+00,  ...,  2.0368e+00,
          1.0847e+00, -1.6405e-03],
        [-6.6422e-01, -5.3049e-01,  9.1888e-01,  ...,  2.4217e+00,
          5.7358e-01, -2.9166e-02],
        [-1.0293e+00, -1.0462e+00,  1.2401e+00,  ...,  1.5686e+00,
          1.4248e-01,  5.7978e-01],
        ...,
        [-4.2677e-01,  5.1037e-01,  1.0064e+00,  ...,  8.6959e-01,
         -8.7426e-01,  7.5206e-01],
        [ 5.4397e-02,  2.1306e-01,  1.3231e+00,  ...,  7.0896e-01,
         -8.9445e-01,  1.6835e-01],
        [ 6.6183e-01,  9.4707e-01,  1.0529e+00,  ...,  8.9192e-02,
         -3.9311e-01,  6.6683e-01]], device='cuda:0',
       grad_fn=<ReshapeAliasBackward0>)

target  (1196)

tensor([ 48, 186, 280,  ..., 406, 406, 406], device='cuda:0')

采用 交叉熵损失

criterion = nn.CrossEntropyLoss(ignore_index=CFG.pad_idx)

 其计算公式为

这里有两个需要注意的点

1、 这个交叉熵损失的输入shape分别为 x (N,C), y(N)。且y的形式就是类别的索引,不用转成one-hot形式。具体来说,标签y的作用就是 索引 y_n,从而确定 x_{n,y_n}的值。进而计算损失l_n

2、 最终输出的损失可以是按 batch 取均值,也可以是加和。

网络结构

EncoderDecoder(
  (encoder): Encoder(
    (model): VisionTransformer(
      (patch_embed): PatchEmbed(
        (proj): Conv2d(3, 384, kernel_size=(16, 16), stride=(16, 16))
        (norm): Identity()
      )
      (pos_drop): Dropout(p=0.0, inplace=False)
      (blocks): Sequential(
        (0): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (1): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (2): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (3): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (4): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (5): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (6): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (7): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (8): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (9): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (10): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
        (11): Block(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=384, out_features=1152, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
            (proj_drop): Dropout(p=0.0, inplace=False)
          )
          (ls1): LayerScale()
          (drop_path1): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (act): GELU()
            (drop1): Dropout(p=0.0, inplace=False)
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
            (drop2): Dropout(p=0.0, inplace=False)
          )
          (ls2): LayerScale()
          (drop_path2): Identity()
        )
      )
      (norm): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
      (fc_norm): Identity()
      (head): Identity()
    )
    (bottleneck): AdaptiveAvgPool1d(output_size=256)
  )
  (decoder): Decoder(
    (embedding): Embedding(407, 256)
    (decoder_pos_drop): Dropout(p=0.05, inplace=False)
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (1): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (2): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (3): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (4): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (5): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (output): Linear(in_features=256, out_features=407, bias=True)
    (encoder_pos_drop): Dropout(p=0.05, inplace=False)
  )
)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

匿名的魔术师

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值