StrongSORT: Make DeepSORT Great Again
摘要
MOT方法大致可分为tracking-by-detection和joint-detection-association算法。对于跟踪精度而言,tracking-by-detection(先检测,通过相似度如位置、外观、运动等信息来关联检测框得到跟踪轨迹)仍然是最优解决方法。
本文先回溯DeepSORT,从detection、embedding和association几个方面进行改进,改进结果称为StrongSORT,在MOT17和MOT20得到HOTA和IDF1新纪录。
本文还引入了两种轻量级的算法进一步改进跟踪结果:1.提出一种无外观特征的连接模型(AFLink)将短轨迹关联成完整的轨迹;2.用高斯光滑插值(GSI)来弥补缺失的检测目标。将上述两种算法加入StrongSORT,最终跟踪器**StrongSORT++**在MOT17和MOT20获得最高的HOTA和IDF1。
2.Related Works
2.1 Seperate and Joints Trackers
MOT可分为seperate trackers12和joints trackers345。Joint trackers是把检测和其他元素如运动、嵌入、关联模型一起联合训练,主要好处在于低计算成本和可观的性能;然而联合跟踪器(joint trackers)面对两个主要的问题:1.不同元素之间的竞争2.用来联合训练元素的数据有限,导致跟踪精度上限受限。因此tracking-by-detection是跟踪的最优方法。
同时,最近一些研究67已经放弃了外观信息而仅仅依赖高性能的检测器与运动信息,在MOTChallenge benchmarks89上获得高运行速率以及尖端性能。但是在更复杂场景中,丢弃外观特征会导致鲁棒性变差。这篇文章回溯DeepSORT式10结构并配备更先进的元素进行改进。
2.2 Global Link in MOT
为得到全局信息,用global link model改善跟踪效果。一般用时空或外观信息生成准确但不完整的轨迹,再离线通过挖掘全局信息来连接这些轨迹。本文提出AFLink,只用到运动信息来预测两轨迹之间的连接关系。对于MOT,这是第一个不用外观信息的、轻量级的全局连接模型。
3.StrongSORT
3.1 Review of DeepSORT10
DeepSORT为两分支结构——外观分支和运动分支。
在外观分支,在数据集MARS上预训练的深度外观描述子(CNN)被用来提取每帧检测目标的外观特征。用特征银行机制存储过去100帧的每条轨迹的外观特征,当有新检测,在特征银行R中第i条轨迹和第j个检测的特征
f
j
f_j
fj之间的最小余弦距离计算公式:此距离被用作关联步骤的匹配代价。
d ( i , j ) = m i n { 1 − f j T f k ( i ) ∣ f k ( i ) ∈ R } \begin{equation} d(i,j)=min\{1-f_j^Tf_k^{(i)}| f_k^{(i)}\in R\} \end{equation} d(i,j)=min{1−fjTfk(i)∣fk(i)∈R}
在运动分支,卡尔曼滤波预测当前帧里轨迹的位置,再用马氏距离计算轨迹与检测之间的时空差距。DeepSORT用此运动距离作为滤除不可能关联的门槛gate。
3.2 Stronger DeepSORT
在DeepSORT进行改进。
1.在外观分支,BoT+ResNeSt50代替CNN得到外观特征;用EMA代替特征银行(feature bank)更新外观状态。
A stronger appearance feature extractor, BoT , is applied to replace the original simple CNN. By taking ResNeSt50 as the backbone and pretraining on the DukeMTMCreID dataset, it can extract much more discriminative features.
2.在运动分支,使用ECC作相机运动补偿;另外,寻常的卡尔曼滤波容易受到低质量检测的影响并忽视有关检测噪音尺度的信息,所以借用NSA卡尔曼算法11所提出的一种适应性计算噪音协方差
R
~
k
\tilde{R}_k
R~k的公式:
R
~
k
=
(
1
−
c
k
)
R
k
\tilde{R}_k=(1-c_k)R_k
R~k=(1−ck)Rk
其中
R
k
R_k
Rk是预设的测量噪音协方差常值,
c
k
c_k
ck实在k的检测置信度分数。
3.用外观和运动信息解决配对问题,而不仅仅用外观信息。成本矩阵C(Cost Matrix)是外观成本
A
a
A_a
Aa和运动成本
A
m
A_m
Am的权重和。
λ
=
0.98
\lambda=0.98
λ=0.98
C
=
λ
A
a
+
(
1
−
λ
)
A
m
C=\lambda A_a+(1-\lambda) A_m
C=λAa+(1−λ)Am
4.StrongSORT++
4.1 AFLink
只用时空信息预测两个轨迹之间的联系,不用外观特征。
A temporal module is applied to extract features by convolving along the temporal dimension with 7 × 1 kernels. Then, a fusion module performs 1 × 3 convolutions to integrate the information from different feature dimensions, namely f, x and y. The two resulting feature maps are pooled and squeezed to feature vectors respectively, and then concatenated, which includes rich spatio-temporal information. Finally, an MLP is used to predict a confidence score for association.
4.2 GSI
插值用来填补由于缺失检测导致的轨迹间隔,线性插值因其简单性而十分常用,但精度受限因为他没有用到运动信息。本文使用高斯过程回归提出轻量级插值,建模非线性运动。
对第i条轨迹建立GSI模型:
p
t
=
f
(
i
)
(
t
)
+
ϵ
p_t=f^{(i)}(t)+\epsilon
pt=f(i)(t)+ϵ
t
∈
T
t\in T
t∈T是帧,
p
t
∈
P
p_t \in P
pt∈P是在t帧的位置坐标,
ϵ
∼
N
(
0
,
σ
)
\epsilon \sim N(0, \sigma)
ϵ∼N(0,σ)。非线性运动建模通过拟合函数
f
(
i
)
f^{(i)}
f(i)解决,假设
f
(
i
)
∈
G
P
(
0
,
k
(
⋅
,
⋅
)
)
f^{(i)}\in GP(0,k(\cdot,\cdot))
f(i)∈GP(0,k(⋅,⋅)),其中
k
(
x
,
x
′
)
=
e
x
p
(
−
∣
∣
x
−
x
′
∣
∣
2
2
λ
2
)
k(x,x')=exp(-\frac{||x-x'||^2}{2\lambda^2})
k(x,x′)=exp(−2λ2∣∣x−x′∣∣2)。基于高斯过程,给定新帧集合
F
∗
F^*
F∗,被平滑位置集合
P
∗
P^*
P∗预测为:
P
∗
=
K
(
F
∗
,
F
)
(
K
(
F
,
F
)
+
σ
2
I
)
−
1
P
P^*=K(F^*,F)(K(F,F)+\sigma^2I)^{-1}P
P∗=K(F∗,F)(K(F,F)+σ2I)−1P
K是基于
k
(
⋅
,
⋅
)
k(\cdot,\cdot)
k(⋅,⋅)的协方差函数,
λ
\lambda
λ控制轨迹光滑度,设计成与长度l自适应的函数:
τ
=
10
\tau=10
τ=10
λ
=
τ
∗
l
o
g
(
τ
3
/
l
)
\lambda=\tau*log(\tau^3/l)
λ=τ∗log(τ3/l)
附录
[1]SORT:解决在线上实时应用中将目标和轨迹高效地相关联。检测器的质量成为影响跟踪性能的决定性因素。本文贡献:1.在MOT中提高了基于CNN的检测器能力,改变检测器能够提升跟踪能力18.9%;2.卡尔曼滤波用来运动预测,匈牙利算法Hungarian method用来数据关联(data association);3.由于本跟踪算法的简单性,跟踪器达到260Hz刷新帧率,是别的先进跟踪器的20倍。1
[2]IOU Tracker:假设每帧检测到的每个目标都被跟踪,意思是在检测之间几乎没有gaps;再假设:一般对于足够高的帧率内,对于一个目标的检测在连续帧之间都有较高的重叠IoU。 I O U ( a , b ) = A r e a ( a ) & A r e a ( b ) A r e a ( a ) A r e a ( b ) IOU(a,b)=\frac{Area(a)\And Area(b)}{Area(a) Area(b)} IOU(a,b)=Area(a)Area(b)Area(a)&Area(b). 根据最高的IoU值将检测与前一帧最后一次检测相关联(前提是IOU超过 σ i o u \sigma_{iou} σiou)。进一步要滤除所有小于 t m i n t_{min} tmin的轨迹或者没有匹配到高于置信度 σ h \sigma_h σh的检测的轨迹。下面伪代码中:F为帧数,T_a=active tracks,T_f=finished tracks,D_f=detections at frame f。2
[5]JDE:设计了目标检测器的预测头部,在一步法(one-shot model)模型中用来同时执行目标检测和ID嵌入提取。给定一帧x,鲜卑特征提取器预处理得到多分辨率特征F,F仿佛HEAD网络(包括目标检测分支和ReID分支)来预测检测结果D和原始ID嵌入地图E。检测结果由NMS生成候选框candidate box,每个候选框从ID嵌入地图E中提取对应的ID嵌入,用来和目前的轨迹相联系。在JDE,共享特征F被直接放入独立执行任务的分支,通过直接采用1x1的卷积层来产生输出。在训练过程中会导致歧义…5
[3]在JDE5基础上改进,提出两种网络组成CSTrack:REN+SAAN=CSTrack(a one-shot online MOT system)。为了更好地平衡目标检测和ReID,减轻训练过程中的竞争,本文提出一种先进的倒数网络(REN),令特征F在被喂入不同任务分支前 被解耦,再用交叉关系层cross-relation layer交换不同任务的语义信息。对于ReID分支,我们设计了一种规模感知注意力网络(scale-aware attention network)SAAN,来融合来自不同特征分辨率的特征。
We propose a novel reciprocal network (REN) with a self-relation and cross-relation design so that to impel each branch to better learn task-dependent representations. Furthermore, we introduce a scale-aware attention network (SAAN) that prevents semantic level misalignment to improve the association capability of ID embeddings.
Our tracker achieves the state-of-the-art performance on MOT16, MOT17 and MOT20 datasets, without other bells and whistles. Moreover, CSTrack is efficient and runs at 16.4 FPS on a single modern GPU, and its lightweight version even runs at 34.6 FPS.
[4]:We propose a conceptually simple and efficient joint model of detection and tracking, called RetinaTrack, which modifies the popular single stage RetinaNet approach such that it is amenable to instance-level embedding training.RetinaTrack allows for extracting instance level features; use these features for tracking and other purposes.
We establish initial strong baselines for detection and tracking from 2d images on the Waymo Open dataset and achieve good performance.4
[6]:Daniel Stadler…等人提出:We combining crowd-specific detectors with a simple tracking pipeline can achieve promising results, especially in challenging scenes with heavy occlusion. Trackers只用到IoU作为association measures.
We evaluate the applicability of three crowd-specific detection models(CrowdDet, IterDet, and NOH-NMS) for MPT(Multi-Pedestrian Tracking) under two different training scenarios – cross-dataset and fine-tuning performance.6
[7]:In crowded scenes, ambiguous situations with similar track-detection distances occur, which leads to wrong assignments. To mitigate this problem, we propose a new association method that separately treats such difficult situations by modelling ambiguous assignments based on the differences in the distance matrix. 1.用相似距离寻找可能的轨迹检测配对,从而对模糊分配建模。2.根据导致模糊分配的轨迹和检测的数目,研究出一种不同的策略:当导致模糊分配的tracks>detections,直接删除delete检测;当导致模糊分配的detections>tracks,使用初始化策略,通过要求检测目标多个连续帧来开始新轨道,从而抑制重复检测。7
A. Bewley, Z. Ge, L. Ott, F. Ramos and B. Upcroft, “Simple online and realtime tracking,” 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 3464-3468, doi: 10.1109/ICIP.2016.7533003. ↩︎ ↩︎
E. Bochinski, V. Eiselein and T. Sikora, “High-Speed tracking-by-detection without using image information,” 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017, pp. 1-6, doi: 10.1109/AVSS.2017.8078516. ↩︎ ↩︎
Liang, C., Zhang, Z., Lu, Y., Zhou, X., Li, B., Ye, X., Zou, J.: Rethinking the competition between detection and reid in multi-object tracking. arXiv preprint arXiv:2010.12138 (2020) ↩︎
Z. Lu, V. Rathod, R. Votel and J. Huang, “RetinaTrack: Online Single Stage Joint Detection and Tracking,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 14656-14666, doi: 10.1109/CVPR42600.2020.01468. ↩︎ ↩︎
Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang. Towards real-time multi-object tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 107–122. Springer, 2020. 2, 3, 5, 8 ↩︎ ↩︎ ↩︎
D. Stadler and J. Beyerer, “On the Performance of Crowd-Specific Detectors in Multi-Pedestrian Tracking,” 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2021, pp. 1-12, doi: 10.1109/AVSS52988.2021.9663836. ↩︎ ↩︎
D. Stadler and J. Beyerer, “Modelling Ambiguous Assignments for Multi-Person Tracking in Crowds,” 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2022, pp. 133-142, doi: 10.1109/WACVW54805.2022.00019. ↩︎ ↩︎
Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taix´e, L.: Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020) ↩︎
Milan, A., Leal-Taix´e, L., Reid, I., Roth, S., Schindler, K.: Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016) ↩︎
N. Wojke, A. Bewley and D. Paulus, “Simple online and realtime tracking with a deep association metric,” 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645-3649, doi: 10.1109/ICIP.2017.8296962. ↩︎ ↩︎
Y. Du, J. Wan, Y. Zhao, B. Zhang, Z. Tong and J. Dong, “GIAOTracker: A comprehensive framework for MCMOT with global information and optimizing strategies in VisDrone 2021,” 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021, pp. 2809-2819, doi: 10.1109/ICCVW54120.2021.00315. ↩︎