MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors 论文学习记录-CSDN博客

本文链接：https://blog.csdn.net/Dan_Ao/article/details/132670667

本文介绍了MOTRv2，一种基于预训练对象检测器的端到端多目标跟踪框架，通过改进DETR并引入YOLOX预测作为提案查询，显著提升了检测和关联性能。研究关注解决DETR的效率问题和小目标检测挑战，以及在DanceTrack、MOT17和BDD100K数据集上的实验验证。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors 论文学习记录

官方仓库地址：https://github.com/megvii-research/MOTRv2
论文下载地址：https://arxiv.org/pdf/2211.09791

前置知识

MOTR： a fully end-to-end framework is introduced for MOT.
计算机视觉深度学习多目标跟踪评价指标总结：
注意力机制：让网络注意到该注意的地方。(空间注意力机制、通道注意力机制)【https://www.bilibili.com/video/BV1rL4y1n7p3/?p=2&spm_id_from=pageDriver&vd_source=be08cd9cc4a3d6f3fec83590352fca21】

DETR

YOLOX不是End-to-End,因为他有前处理（Anchor）以及后处理（NMS）
而DETR 不用Anchor 也不用 NMS

Deformable DETR

前前置知识：

可变形卷积

在这里插入图片描述
思想：标准卷积的感受野是固定的，可变形卷积的感受野会改变，能够更加适配地找到要观测的目标

研究背景

DETR有缺点：慢（Transformer 注意力机制限制）、小目标检测难
在这里插入图片描述

deformable convolution 和 Transformer结合是解决DETR难题的核心点。
可变形注意力机制模块
特征空间对齐问题
reference point：坐标

MOTR

轨迹预测
是在DETR的基础上进行扩展和改进的，MOTR在DETR的基础上，将MOT问题形式化为一系列序列预测问题，并通过跟踪查询来实现序列预测[

Transformer

【Transformer模型】曼妙动画轻松学，形象比喻贼好记
Transformer是干啥的？能够更加整体的提取特征，考虑的更加全局一些。

读论文

总结

Q1 论文试图解决什么问题？

One major limitation of the end-to-end multiple-object tracking frameworks is their poor detection performance, compared to tracking-by detection approaches [6, 44] that rely on standalone object detectors.
相对于联合学习检测和关联任务而言，MOTR 的检测性能较差。
Easing the conflict between the detection and association tasks in the shared transformer decoder.

Q2 这是否是一个新的问题？

不是一个新问题，他是MOTR的改进

Q3 这篇文章要验证一个什么科学假设？

当加上yolox作为辅助的判断器后，MOTR处理能力更强了。

Q4 有哪些相关研究？如何归类？谁是这一课题在领域内值得关注的研究员？

相关研究有很多，就比如Comparison table的这些个方法：
归类->多目标跟踪
值得关注的研究员：

Q5 论文中提到的解决方案之关键是什么？（消融实验是证据）（Discussion有总结）

ablation:

Using YOLOX predictions as proposal queries consistently improves all three metrics (HOTA, DetA, and AssA) regardless of whether the CrowdHuman dataset is used.
Using the anchor boxes instead of points is not only critical for introducing YOLOX detection results but also sufficient for providing the MOTR decoder with localization information.(conclusion: width and height information is critical)
The confidence score provides important information for MOTR.
using sine-cosine encoding works better for the association.
Aligning anchors with the corresponding YOLOX proposals mitigates the accumulation of localization errors during anchor propagation, thereby improving both detection and association accuracy

Summary:

YOLOX generates high-quality object proposals that help MOTR detect new objects more easily.

Q6 论文中的实验是如何设计的？(主要看第四部分的Experiments)

Datasets We use the DanceTrack , MOT17 and BDD100K datasets to evaluate our approach.
The YOLOX detector is trained on 8 Tesla V100 GPUs for 16 epochs. 作者就把这个东西当成 Proposal Generator，然后就能给MOTR给出建议
MOTR的实现是基于官方仓库 + ResNet50的backbone用来特征提取。

Q7 用于定量评估的数据集是什么？代码有没有开源？

Datasets We use the DanceTrack [27], MOT17 [15, 21] and BDD100K [42] datasets to evaluate our approach.
代码开源，地址为：https://github.com/megvii-research/MOTRv2

Q8 论文中的实验及结果有没有很好地支持需要验证的科学假设？

在这里插入图片描述
Performance comparison between MOTR (grey bar) and MOTRv2 (orange bar) on the DanceTrack and BDD100K datasets. MOTRv2 improves the performance of MOTR by a large margin under different scenarios.

以及，
4.3. State-of-the-art Comparison on DanceTrack
4.4. State-of-the-art Comparison on BDD100K
4.5. Comparison on the MOTChallenge

Q9 这篇论文到底有什么贡献？

MOTRv2 breaks through the common belief that end-to-end frameworks are not suitable for high- performance MOT and explains why previous end-to-end MOT frameworks have failed. We hope it can provide some new insights on end-to-end MOT for the community.
简言之：以前的MOTR框架或类似框架性能不高，这个MOTRv2处理MOT数据的性能高了点。

Q10 下一步呢？有什么工作可以继续深入？（Limitations里面）

Although using the YOLOX proposals greatly ease the optimization problem of MOTR, the pro-posed method is still data-hungry and does not perform well enough on smaller datasets.
所以可以进一步对小数据集问题进行优化。
Solution: 将原始图片旋转一个小角度、添加随机噪声。。。
行人重叠问题，两个人重叠的时候，其中一个人的track query会停止，就好像直接并入到了另一个人的轨迹当中。
This observation could serve as a valuable hint for potential enhancements in the future.
效率问题，并且瓶颈在MOTR的效率上。
the YOLOX [11] detectorruns at 25 FPS while MOTR runs at 9.5 FPS on 2080Ti.Adding these two components yields a speed of 6.9 FPS.