目录
0 前言
PaperDiffusionDet: Diffusion Model for Object Detection
Code: https://github.com/ShoufaChen/DiffusionDet
1 Abstract
作者提出了DiffusionDet模型。训练时,扩散过程,模型从ground truth的box开始加noise;反向过程,模型学习去噪。推理时,模型将一组随机生成的box逐步refine成output。
作者的两个发现:
- Random boxes,although drastically different from pre-defined anchors or learned queries, are also effective object candidates.
- 目标检测任务可以用生成模型解决。
2 Motivation
- 如何在没有heuristic object priors 和 learnable queries的情况下实现目标检测?
- 传统的针对image的diffusion model实现了将添加了噪声的image去噪成带有语义的无噪声的清晰image。而针对目标检测,能否将添加了大量随机box(类似于添加噪声)的image去掉多余的box(类似于去噪),留下带有正确box的image(类似于带有语义的无噪声的清晰image)。
3 Model

3.1 Object Detection
目标检测的学习目标是input-target pairs ( x , b , c ) (\mathbf{x},\mathbf{b},\mathbf{c}) (x,b,c) ,其中 x \mathbf{x} x是image, b = ( c x i , c y i , w i , h i ) \mathbf{b} = (c_x^i,c_y^i,w^i,h^i) b=(cxi,cyi,wi,hi), ( c x i , c y i ) (c_x^i,c_y^i) (cxi,cyi)为bounding box的center。
3.2 Diffusion model
此处不再详细讲解,不了解的朋友可以自行阅读相关文章。这里仅针对DiffusionDet作简单注释。
L t r a i n = 1 2 ∣ ∣ f θ ( z t , t ) − z 0 ∣ ∣ 2 L_{train} = \frac{1}{2}\vert\vert{f_\theta(z_t,t )-z_0\vert\vert}^2 Ltrain=21∣∣fθ(zt,t)−z0∣∣2
在DiffusionDet中, z 0 = b , b ∈ R N × 4 \mathbf{z_0}=\mathbf{b},\mathbf{b}\in{\mathbb{R}^{N\times4}} z0=b,b∈RN×4
3.3 Architecture
Diffusion模型的一大痛点是其迭代计算的方式导致训练与推理花费较大。如果DiffusionDet直接使用 f θ ( z t , t ) f_\theta(z_t,t ) fθ(zt,t)计算量大,因此作者用了encoder-decoder架构。
- Encoder
Backbone:ResNet+Transformer-based models like Swin. - Detection decoder
Just like Sprase R-CNN.
3.4 Training
- Ground truth padding.
Padding some extra boxes to original ground truth boxes such that all boxes are summed up to a fixed number N t r a i n N_{train} Ntrain. - Box corruption.
- Training losses.
3.5 Inference
- Sampling step.
上一步的boxes送给encoder,然后用DDIM预测下一步的boxes。 - Box renewal.
每一步被预测出的boxes有两种类型:desired and un desired predictions.desired要保留,而undesired是arbitrary,但是这个arbitrary是被预测出的arbitrary,并不是扩散过程中产生的随机高斯噪声。
为此,作者提出box renewal:①剔除undesired boxes(scores lower than a particular threshold);②Concatenating some new boxes sampled from Gaussian distribution. - Once-for-all.
Once the model is trained, it can be used with changing the number of boxes and number of sample steps in inference.
4 Properties
DiffusionDet can achieve better accuracy by using more boxes or/and more refining steps at the cost of higher latency.
- Dynamic boxes.增加boxes数量可以提高accuracy,但是增加了cost.
- Progressive refinement. 增大iterate次数可以提高accuracy,但是增加了cost.
Conclusion
DiffusionDet第一次实现了将diffusion model应用到object detection,noise-to-box pipeline has several appealing properties, including dynamic box and progressive refinement, enabling us to use the same network parameters to obtain the desired speed-accuracy trade-off without re-training the model.