目录
前向传播过程
batch_preds--> tgt-->tgt=cat(tgt, padding)-->tgt_embedding
-->tgt_mask,tgt_padding_mask
image 送入Encoder
1、batch_preds 的shape 为 (batch size, 1)其中填充的都是开始token标志, 404
2、tgt 的 shape 为 (batch size, 299), 其中 索引0处为开始token标志 406,其余位置为填充token标志 406, 其长度 299 由 max_length-1得到。
3、tgt_embedding 为词嵌入向量,shape为 (batch size, length, embedding dim),其中 length为句子长度, embedding dim 为词典的通道数。
tgt_mask 对tgt进行mask,常用。Shape为(词数, 词数);tgt_padding_mask对tgt的token进行mask。常用。Shape为(batch_size, 词数)
4、image 经过Encoder,输出的shape为(batch_size, num patch, 词向量维度),num patch是划分的patch的数量,
其中Embedding词向量转换见这里。
如果说用语言翻译来做一个比方,则输入的图像就如文中所述的那样
它就是一种“方言” ,目标就是根据它生成表示 边界框坐标以及类别的 “语言”。
1、经过Transformer后的输出送入线性层,使得预测的向量shape转变为 (batch size,词数,词向量维度)--->(batch size,词数,字典单词数量),输出预测单词的概率分布
2、然后每次按词数通道,只拿处索引length-1处的词向量。注意,这里是自回归方式,随着每次的循环,这里的 length都会加1,所以索引会逐次增加。这也就表明逐次按位置顺序预测。
3、使用最大概率采样方式,从预测的分布中得到token标记。
4、将预测的token标记 加入到batch preds中,依次执行循环
5、一共执行 max len次循环,这里的max len是预测的句子的最大长度,而不是输入的句子的最大长度,这两个不一样的。
6、在输出预测 label token 标记时,会用softmax处理一下preds,从而得到表示该次预测类别的置信度。
decoder 的输出 (1,299,256)
---->> 1所在的维度表示batch size的维度,一次性处理几张图片。相当于NLP中的一次性处 理几个句子
299所在维度表示预测的序列的最大长度,相当于NLP中预测的句子的最大长度包含多少个单词。最大长度,但是实际预测的时候并没有都用到这个最大长度,这个最大长度应该是在训练时设置的,再测试时代码中采取了只预测101个序列。而且5个长度才能预测一个bbox,所以这里最多预测20个目标。实际上,这个101可以自行设置,
with torch.no_grad(): for i in range(max_len): # 这里 max_len=CFG.generation_steps
注意,代码中用自回归方式进行预测时,每次的图片输入保持不变,变的是 batch_preds这个。 而 batch_preds的更新会更新 padding
length = tgt.size(1) padding = torch.ones(tgt.size(0), CFG.max_len-length-1).fill_(CFG.pad_idx).long().to(tgt.device) # padding 随着 length的变大而变小 tgt = torch.cat([tgt, padding], dim=1) # 这里的总长度不变,保持为(1,299)
进而更新 tgt_padding_mask,而这个tgt_mask 是恒定不变的,为下三角矩阵
tgt_mask
tensor([[0., -inf, -inf, ..., -inf, -inf, -inf], [0., 0., -inf, ..., -inf, -inf, -inf], [0., 0., 0., ..., -inf, -inf, -inf], ..., [0., 0., 0., ..., 0., -inf, -inf], [0., 0., 0., ..., 0., 0., -inf], [0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')
对输出的后处理过程
1、首先计算第一个结束token标记的索引,然后看看 该索引-1 是否是5的倍数。这是因为 如果把结束标记与开始标记都算进预测的句子长度,bbox + class 五个一组,而索引又是从0开始,所以只需所以减1就相当于句子长度去掉了开始 和 结束标记
2、然后进行反量化,得到最终的预测结果
Tokenizer
采用自回归方式生成预测,前向过程后生成的预测结果可视化如下
其中的404由
num_bins + class
得出。实际离散化后包含406个标记,因为加入了开始(404)和结束(405)标记,以及填充标记(406)
得到上述的网络的输出预测后,开始对这些进行处理。
1、 得到第一个结束标志 EOS 的索引 index
2、 判断 index-1 是否是 5 的倍数,若不是,则本次的预测不进行处理,默认没有检测到任何目标
3、 去掉额外填充的噪声
4、 迭代的每次拿出5个token
5、 前4维 为 box的信息,第5维为类别信息
6、 预测的表示类别的离散化token需要减去 num_bins,才是最后的类别
7、 box 反离散化, box / (num_bins - 1), 这个是输出特征尺度下的归一化的box的坐标
8、 将box的尺度返回输入图片的尺度, box的信息为 (Xmin,Ymin,Xmax,Ymax)
训练过程:
数据处理
df
classes 为类别的种类,一共20个类别
['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
采样数据
sample 为一张图片中的目标标签,如这里包含8个目标
进行 增强,即transformed
接下来制作 输入样本
1、对bboxes进行量化处理,
,
同理y轴也一样,不过除以的是height。对标签的量化直接
2、将标签和量化后的bboxes 配对,随机打乱顺序
3、制作 token 序列
# list bbox [303, 180, 364, 304] # 接着后面加入标签 [303, 180, 364, 304, 399] # 前面加入 开始标志 BOS token [404, 303, 180, 364, 304, 399] # 接下来循环执行,最后 示例如下 [404, 303, 180, 364, 304, 399, 197, 187, 251, 271, 399, 34, 172, 65, 217, 399, 94, 152, 147, 253, 399, 62, 160, 100, 247, 399, 247, 151, 306, 285, 399, 4, 156, 47, 213, 399, 137, 128, 196, 260, 399] #最后加入结束标志 EOS token [404, 303, 180, 364, 304, 399, 197, 187, 251, 271, 399, 34, 172, 65, 217, 399, 94, 152, 147, 253, 399, 62, 160, 100, 247, 399, 247, 151, 306, 285, 399, 4, 156, 47, 213, 399, 137, 128, 196, 260, 399, 405]
至此,单个图片的token序列制作完毕
4、只是单个图片的token序列完成还不算完。因为文章想要模仿NLP的语句生成任务,所以,传入的 token 序列的长度必须有一个预设值的最大长度,也就是NLP中的提示的最大句子长度。
>>1) 按batch进行处理,对每个 batch中的 token 序列长度进行补齐,按当前batch 中 token序列最长的补
seq_batch = pad_sequence( seq_batch, padding_value=pad_idx, batch_first=True)
语法详见这里
>>2) 然后设置填充 pad, 值全为406, 并与补齐后的 seq_batch拼接在一起,最终生成 max_length的token 序列。
至此,输入的token序列的形式处理完成
网络的输入以及标签
y_input (4,299)
y_input = y[:, :-1]
tensor([[404, 48, 186, ..., 406, 406, 406],
[404, 60, 62, ..., 406, 406, 406],
[404, 150, 24, ..., 406, 406, 406],
[404, 154, 259, ..., 406, 406, 406]], device='cuda:0')
y_expected (4,299)
y_expected = y[:, 1:]
tensor([[ 48, 186, 280, ..., 406, 406, 406],
[ 60, 62, 336, ..., 406, 406, 406],
[150, 24, 225, ..., 406, 406, 406],
[154, 259, 381, ..., 406, 406, 406]], device='cuda:0')
输入为带 开始标志的BOS 的token 序列, 标签为 带结束标志的EOS的token序列。这个 根据 NLP 任务的设计,与语句生成任务差不多。 因为 transformer的并行处理能力,所以不需要串行去训练,一次直接训练所有token序列。
损失函数的计算
preds (4,299,407)
tensor([[[-3.4124e-01, -2.8577e-01, 1.2335e+00, ..., 2.0368e+00,
1.0847e+00, -1.6405e-03],
[-6.6422e-01, -5.3049e-01, 9.1888e-01, ..., 2.4217e+00,
5.7358e-01, -2.9166e-02],
[-1.0293e+00, -1.0462e+00, 1.2401e+00, ..., 1.5686e+00,
1.4248e-01, 5.7978e-01],
[-4.2677e-01, 5.1037e-01, 1.0064e+00, ..., 8.6959e-01,
-8.7426e-01, 7.5206e-01],
[ 5.4397e-02, 2.1306e-01, 1.3231e+00, ..., 7.0896e-01,
-8.9445e-01, 1.6835e-01],
[ 6.6183e-01, 9.4707e-01, 1.0529e+00, ..., 8.9192e-02,
-3.9311e-01, 6.6683e-01]]], device='cuda:0', grad_fn=<AddBackward0>)
input (1196,407)
tensor([[-3.4124e-01, -2.8577e-01, 1.2335e+00, ..., 2.0368e+00, 1.0847e+00, -1.6405e-03], [-6.6422e-01, -5.3049e-01, 9.1888e-01, ..., 2.4217e+00, 5.7358e-01, -2.9166e-02], [-1.0293e+00, -1.0462e+00, 1.2401e+00, ..., 1.5686e+00, 1.4248e-01, 5.7978e-01], ..., [-4.2677e-01, 5.1037e-01, 1.0064e+00, ..., 8.6959e-01, -8.7426e-01, 7.5206e-01], [ 5.4397e-02, 2.1306e-01, 1.3231e+00, ..., 7.0896e-01, -8.9445e-01, 1.6835e-01], [ 6.6183e-01, 9.4707e-01, 1.0529e+00, ..., 8.9192e-02, -3.9311e-01, 6.6683e-01]], device='cuda:0', grad_fn=<ReshapeAliasBackward0>)
target (1196)
tensor([ 48, 186, 280, ..., 406, 406, 406], device='cuda:0')
采用 交叉熵损失
criterion = nn.CrossEntropyLoss(ignore_index=CFG.pad_idx)
其计算公式为
这里有两个需要注意的点
1、 这个交叉熵损失的输入shape分别为 x (N,C), y(N)。且y的形式就是类别的索引,不用转成one-hot形式。具体来说,标签y的作用就是 索引 ,从而确定 的值。进而计算损失
2、 最终输出的损失可以是按 batch 取均值,也可以是加和。
网络结构
EncoderDecoder(
(encoder): Encoder(
(model): VisionTransformer(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 384, kernel_size=(16, 16), stride=(16, 16))
(norm): Identity()
)
(pos_drop): Dropout(p=0.0, inplace=False)
(blocks): Sequential(
(0): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(1): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(2): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(3): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(4): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(5): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(6): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(7): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(8): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(9): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(10): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(11): Block(
(norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
)
(norm): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(fc_norm): Identity()
(head): Identity()
)
(bottleneck): AdaptiveAvgPool1d(output_size=256)
)
(decoder): Decoder(
(embedding): Embedding(407, 256)
(decoder_pos_drop): Dropout(p=0.05, inplace=False)
(decoder): TransformerDecoder(
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
)
(output): Linear(in_features=256, out_features=407, bias=True)
(encoder_pos_drop): Dropout(p=0.05, inplace=False)
)
)