目录
环境配置与脚本编写
按照官网执行并没有顺利完成,将yaml文件中的 pip 项 手动安装的
conda create -n artrack python=3.9
# 启动该环境,并跳转到项目主目录路径下
astor==0.8.1 configparser==5.2.0
data==0.4 docker-pycreds==0.4.0 easydict==1.9 einops==0.4.1 formulaic==0.5.2 funcsigs==1.0.2 future==0.18.2
gitdb==4.0.9 gitpython==3.1.27 interface-meta==1.3.0 iopath==0.1.9 jpeg4py==0.1.4 jsonpatch==1.32 jsonpointer==2.3 latex==0.7.0
libarchive-c==2.9 linearmodels==4.29 lmdb==1.3.0 loguru==0.6.0 mat73==0.59 memory-profiler==0.60.0 msgpack==1.0.2 ninja==1.11.1
opencv-python==4.5.5.64 pathtools==0.1.2 promise==2.3 property-cached==1.6.4 protobuf==3.20.0 pycocotools==2.0.4 pyhdfe==0.1.2
ruamel-yaml-conda==0.15.100 sentry-sdk==1.5.8 setproctitle==1.2.2 setuptools-scm==7.1.0 shapely==1.8.1.post1 shortuuid==1.0.8
shutilwhich==1.1.0 smmap==5.0.0 tables==3.6.1 tempdir==0.7.1 tensorboardx==2.5.1 thop==0.1.0.post2207010342 tikzplotlib==0.10.1
timm==0.5.4 tomli==2.0.1 torch==1.11.0 torchfile==0.1.0 visdom==0.1.8.9 wandb==0.12.11 webcolors==1.12 yaspin==2.1.0
里面的默认路径需要改写
python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir ./outpu
官网下载训练好的模型,创建路径,将模型放在该路径下
ARTrack-main/output/checkpoints/train/artrack_seq/artrack_seq_256_full/ARTrackSeq_ep0060.pth.tar
创建encoder的预训练模型路径,并把预训练模型放入这里,在yaml文件中进行更改,并且源脚本文件 artrack_seq.py中也需要更改
mkdir pretrained_model
#
mae_pretrain_vit_base.pth 文件名 官网有提供链接
# artrack_seq_256_full.yaml 中用绝对路径改写
PRETRAIN_PTH: "/root/data/zjx/Code-subject/ARTrack/ARTrack-main/pretrained_models"
# 同时将artrack_seq.py --100 中的
load_from = cfg.MODEL.PRETRAIN_PTH
# 改为
load_from = cfg.MODEL.PRETRAIN_PTH +'/' + cfg.MODEL.PRETRAIN_FILE
#同时将 artrack_seq.py -- 103 中的
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["net"], strict=False)
# 改为
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["model"], strict=False)
代码中没有实现 run video 的脚本,这里需要自定义一个脚本实现
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os
import random
import argparse
import multiprocessing
import cv2
import torch
import torch.nn as nn
import numpy as np
from glob import glob
from lib.test.evaluation.tracker import Tracker
import sys
prj_path = os.path.join(os.path.dirname(__file__), '..')
if prj_path not in sys.path:
sys.path.append(prj_path)
torch.set_num_threads(1)
parser = argparse.ArgumentParser(description='Run tracker on sequence or dataset.')
parser.add_argument('tracker_name', type=str, help='Name of tracking method.')
parser.add_argument('tracker_param', type=str, help='Name of config file.')
parser.add_argument('--runid', type=int, default=None, help='The run id.')
parser.add_argument('--video_path', type=str, default='None', help='Name of dataset (otb, nfs, uav, tpl, vot, tn, gott, gotv, lasot).')
parser.add_argument('--sequence', type=str, default=None, help='Sequence number or name.')
parser.add_argument('--debug', type=int, default=0, help='Debug level.')
parser.add_argument('--threads', type=int, default=0, help='Number of threads.')
parser.add_argument('--num_gpus', type=int, default=8)
args = parser.parse_args()
def main(): # 这里已经是图片了
colors = [random.randint(0, 255) for _ in range(3)]
print('[INFO] Loading the model')
# load config
trackers = Tracker(args.tracker_name, args.tracker_param, None, args.runid)
try:
worker_name = multiprocessing.current_process().name
worker_id = int(worker_name[worker_name.find('-') + 1:]) - 1
gpu_id = worker_id % args.num_gpus
torch.cuda.set_device(gpu_id)
except:
pass
trackers.run_video(args.video_path, None, None, None, False)
if __name__=='__main__':
main()
执行
python tracking/run_video.py artrack_seq artrack_seq_256_full --video_path /root/data/zjx/Code-subject/OSTrack-main/experiments/video/soccer1.avi
前向传播过程
裁剪模板区域和OSTrack代码一样,初始化的时候,为需要保留的N帧的bbox的坐标信息创建了一个buffer--self.store_result,初始化时全为 init bbox,N的值此时设置为7
for i in range(self.save_all - 1):
self.store_result.append(info['init_bbox'].copy())
搜索区域的裁剪和OSTrack的一样。将之前帧的坐标进行变换, 以前一帧预测的坐标为参考点计算相对坐标,因为当前帧的裁剪的搜索区域的就是以上一帧预测的bbox为中心进行裁剪的,所以搜索区域的中心实则是前一帧预测的bbox的中心。只不过前一帧预测的bbox为原img的尺度,而搜索区域为crop size上的尺度,因此,只需要将计算原img尺度上的也就是之前帧的预测的坐标与前一帧预测的坐标的相对坐标,再乘以resize factor就可以将相对坐标转换到crop size 的尺度下。并且,前一帧的预测的bbox转换实则移到了搜索区域的中心点,也就是 (crop_size/2, crop_ size/2)。
转换后除以 crop size 进行了归一化,不过这里有可能会 小于0 或者 大于 1,因为坐标变换可能会超出边界。接下来将xywh转换成 xyxy 形式,并筛选只保留(-0.5,1.5)区间的。然后对坐标进行量化
box_out_i = (box_out_i + 0.5) * (self.bins - 1) # 量化
。加上0.5 为了防止 出现负数,最终将bbox量化到 2*(bins-1)之间。最终,包含时空上下文信息的坐标输入为
seqs_out = seqs_out.unsqueeze(0) # (1,28) 包含了连续7帧的bbox信息,初始化时是模板帧的重复
将 模板 和 搜索区域送入 ViT backbone中进行特征提取,这个过程中一共 16倍 下采样。然后将 提取的 sequence patch、以及位置编码、外加之前转换后的之前帧的bbox的信息 送入 接下来的Transformer中。
首先进入一个 encoder,在FeatureFusionEncoder类中进行一些预处理,主要的基本模块是 FeatureFusion 模块。这个encoder的主要过程如下所示,最终返回 z 和 x 一样shape的特征 patch。
接下来 ,将 之前帧的 bbox 坐标序列以及开始标志拼接在一起,作为decoder的输入 sequence。因为只需要预测bbox的坐标,所以不需要额外的结束标志,输出的序列长度直接为4即可。
1、 将输入的sequence 进行词汇嵌入,词向量的长度是crop img 下采样得到的特征patch的分辨率
2、 将初始输入tgt、模板特征、搜索特征、patch z的位置编码、 x patch的位置编码、identity高斯截断分布、高斯截断分布、查询嵌入、输入序列的掩码 送入decoder
decoder主要由TargetQueryDecoderLayer层组成。该模块的前向过程如下所示,一共有6层
最终输出和 tgt shape一样的token sequence。得到的输出的shape为(1,length,768),这个length为tgt的长度,随sequence的预测而逐渐增加。接下来,
1、 拿出得到的 query的 最后一个单词嵌入,并与词向量的权重矩阵进行矩阵乘法,得到与每个位置量化后的相关联的预测值。
2、 取softmax,得到关于量化后的坐标的概率分布。
3、 采用argmax sampleing,也就是看最大概率的位置。
4、 将当前预测的量化后的坐标加入到 tgt当中,执行循环。
5、 最终得到预测的bbox的量化坐标。
得到网络的输出预测后,
1、 bbox坐标反量化
2、 xyxy 转为 xywh 中心点加长宽
3、 尺度返回到原img, 转成 xywh, 左顶点加长宽
4、 平滑处理,去掉bbox超出图片的部分
5、 对于之前保存的坐标信息,将最靠前的弹出去,在最靠后的也就是前一帧的坐标加入当前预测的。好比出栈入栈操作。
训练的前向过程
只是看一下前向过程,将train.py 中 单卡训练的 脚本路径用绝对路径代替
train_cmd = "python /root/data/zjx/Code-subject/ARTrack/ARTrack-main/lib/train/run_training.py --script
在 artrack/config.py文件中pretrained 路径 用绝对路径替换一下
cfg.MODEL.PRETRAIN_PTH = "/root/data/zjx/Code-subject/ARTrack/ARTrack-main/pretrained_models"
在 lib/models/artrack/artrack.py 文件中进行一下更改
# 将
load_from = cfg.MODEL.PRETRAIN_PTH
# 改为
load_from = cfg.MODEL.PRETRAIN_PTH + '/' + cfg.MODEL.PRETRAIN_FILE
# 将
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["net"], strict=False)
# 改为
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["model"], strict=False)
终端运行
python tracking/train.py --script artrack --config artrack_256_full --save_dir ./output --mode single
在本地配置时,不要再train.py中,而是直接去run training 文件中配置
--script artrack --config artrack_256_full --save_dir ./output --use_lmdb 0 --script_prv None --config_prv baseline --distill 0 --script_teacher None --config_teacher None --use_wandb 0
数据的处理过程
论文中说过,
因此,在artrack 的情况下,依然是per frame 处理的。
在视频数据集中随机选取一个视频id,拿出其所有序列的注释
seq_info_dict
data
可以看到,此时数据中包含的只有一帧的搜索区域
但是在 artrack_seq的情况下则不是。
执行
python tracking/train.py --script artrack_seq --config artrack_seq_256_full --save_dir ./output --mode single
采样过程
1、 随机选取一条视频序列,拿出其ground truth信息,包括bbox,visible等
2、 在0 ~ len(visible)-num_search_frame之间随机选取一帧作为template
3、 计算 search frame可取的范围,
在保证template位于search之前的前提下,保证 template frame + max gap 做为 search 的最大的结尾可选取范围,也就是模板和搜素区域的时间关系上最大不要超过max gap帧,其做为 left max。
然后在 template frame ~ left max 之间随机选取一帧数id做为 left。
4、 以 left id 为起始点,连续选取 num search frame 帧做为搜索区域。代码中选取了36帧。 以 left起始可供选取的连续帧不足 num search frame,则将 最靠后的帧 进行复制,直到满足选取的帧数为止
5、 经过上述过程,采样到了 模板帧的id, 搜索区域帧的id,并且在时间轴上模板帧位于搜索帧之前,搜索帧是连续的。
6、 通过 数据集工具 API, 用上面得到的 视频id,选取的帧id, ground truth信息字典传入接口, 即可得到 图片img,注释anno,以及属性meta obj
7、 将上述得到的每一组保存成字典,做为一组训练样本
模型读数据的处理过程
1、 对一共选取的连续 num frames帧搜索图片进行循环处理,一共需要循环 num frames次
2、 第一次循环时,会进行对模板信息的初始化,包括对模板img进行处理。根据template img 和 template bbox,对原模板img裁剪并resize成要求的输入大小。 以及初始化保留的轨迹 bbox,
全部为最开头的搜索帧的 复制, 一共复制了 N 次。N为文中跟踪过程选取的搜索区域的帧数。
3、 从第二次循环开始,开始对样本中的 搜索img 进行处理。 对搜索区域图片进行裁剪并resize 成要求的大小。
4、
seq_info_dict
选取的模板帧和搜索帧的id举例
# template frame ids
[76]
# search frame ids
[109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144]
search_anno
网络结构
ARTrackSeq(
(backbone): VisionTransformer(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
(norm): Identity()
)
(pos_drop): Dropout(p=0.0, inplace=False)
(blocks): Sequential(
(0): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): Identity()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(1): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(2): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(3): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(4): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(5): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(6): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(7): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(8): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(9): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(10): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
(11): Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
)
)
(norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
)
(pix_head): Pix2Track(
(word_embeddings): Embedding(802, 768, padding_idx=800, max_norm=1)
(position_embeddings): Embedding(5, 768)
(prev_position_embeddings): Embedding(28, 768)
(encoder): FeatureFusionEncoder(
(layers): ModuleList(
(0): FeatureFusion(
(z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_self_attn): SelfAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(x_self_attn): SelfAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_x_cross_attention): CrossAttention(
(q): Linear(in_features=768, out_features=768, bias=True)
(kv): Linear(in_features=768, out_features=1536, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(x_z_cross_attention): CrossAttention(
(q): Linear(in_features=768, out_features=768, bias=True)
(kv): Linear(in_features=768, out_features=1536, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(x_mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
(1): FeatureFusion(
(z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_self_attn): SelfAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(x_self_attn): SelfAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_x_cross_attention): CrossAttention(
(q): Linear(in_features=768, out_features=768, bias=True)
(kv): Linear(in_features=768, out_features=1536, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(x_z_cross_attention): CrossAttention(
(q): Linear(in_features=768, out_features=768, bias=True)
(kv): Linear(in_features=768, out_features=1536, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(x_mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
(2): FeatureFusion(
(z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_self_attn): SelfAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(x_self_attn): SelfAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_x_cross_attention): CrossAttention(
(q): Linear(in_features=768, out_features=768, bias=True)
(kv): Linear(in_features=768, out_features=1536, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(x_z_cross_attention): CrossAttention(
(q): Linear(in_features=768, out_features=768, bias=True)
(kv): Linear(in_features=768, out_features=1536, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.1, inplace=False)
)
(z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(z_mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(x_mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
)
(z_pos_enc): Untied2DPositionalEncoder(
(pos): Learned2DPositionalEncoder()
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(pos_q_linear): Linear(in_features=768, out_features=768, bias=True)
(pos_k_linear): Linear(in_features=768, out_features=768, bias=True)
)
(x_pos_enc): Untied2DPositionalEncoder(
(pos): Learned2DPositionalEncoder()
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(pos_q_linear): Linear(in_features=768, out_features=768, bias=True)
(pos_k_linear): Linear(in_features=768, out_features=768, bias=True)
)
(z_rel_pos_bias_table): RelativePosition2DEncoder()
(x_rel_pos_bias_table): RelativePosition2DEncoder()
(z_x_rel_pos_bias_table): RelativePosition2DEncoder()
(x_z_rel_pos_bias_table): RelativePosition2DEncoder()
)
(decoder): TargetQueryDecoderBlock(
(layers): ModuleList(
(0): TargetQueryDecoderLayer(
(norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(self_attn1): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlpz): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
(1): TargetQueryDecoderLayer(
(norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(self_attn1): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlpz): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
(2): TargetQueryDecoderLayer(
(norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(self_attn1): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlpz): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
(3): TargetQueryDecoderLayer(
(norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(self_attn1): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlpz): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
(4): TargetQueryDecoderLayer(
(norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(self_attn1): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlpz): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
(5): TargetQueryDecoderLayer(
(norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(self_attn1): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlpz): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU()
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.1, inplace=False)
)
(drop_path): Identity()
)
)
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
)