《LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search》
CVPR-2021
文章目录
1 Background and Motivation
基于深度学习的单目标跟踪发展迅速,但 increasingly heavy and expensive,limits their deployments in resource-constrained applications
本文采用 NAS 技术,自动搜索出更适合单目标跟踪的网络结构 LightTrack,性能提升明显,同时计算量和参数量都有所减少
2 Related Work
-
Object Tracking
more precise box estimation(anchor based or anchor free)
more powerful backbone
online update(ATOM / DiMP / ROAM) -
Neural Architecture Search
search space to be continuous, such that the search can be optimized by the efficient gradient descent
3 Advantages / Contributions
分为主干和头部结构
template 和 search 是孪生结构,通过相关操作特征汇聚在一起
更多细节可以参考 【SiamRPN】《High Performance Visual Tracking With Siamese Region Proposal Network》
(1)用 one-shot NAS 自动搜索出更适合单目标跟踪的网络结构,LightTrack
(2)design a lightweight search space and a dedicated search pipeline for object tracking
4 Method
4.1 Preliminaries on One-Shot NAS
《Single Path One-Shot Neural Architecture Search with Uniform Sampling》(ECCV-2020)
(a)为了减少权重之间的耦合度,在每个Choice Block选择的时候必定会选择其中的一个choice——均匀采样策略,不存在恒等映射
(b)提出了一个基于权重共享的choice block, 其核心思想是预先分配一个最大通道个数的,然后随机选择通道个数,切分出对应的权重进行训练。通过权重共享策略,发现超网可以快速收敛。
来自 【神经网络搜索】Single Path One Shot
看看公式化表达
N \mathcal{N} N 表示 supernet
A \mathcal{A} A architecture search space——作者引入了 mobilenetv1 的 depthwise separable convolutions(【MobileNet】《MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications》)和 mobilenetv2 的 inverted residual structure(【MobileNet V2】《MobileNetV2:Inverted Residuals and Linear Bottlenecks》)
W W W 是网络的权重
subnets α ∈ A \alpha \in \mathcal{A} α∈A in N \mathcal{N} N
Only the weights of the single supernet N need to be trained
single-path uniform sampling strategy
只训练 supernet,subnet 是 supernet 中的均匀采样抽出来的子结构,
in each batch, only one random path is sampled for feedforward and backward propagation, while other paths are frozen
4.2 Tracking via One-Shot NAS
NAS 用在单目标跟踪的挑战
(1)需要考虑主干网络在 ImageNet 上的 pretrain,也要考虑其在 tracking data 上的微调——后续实验证明不在 ImageNet 上 pretrain 效果很差
(2)搜索出来的结构既要考虑主干网络的特征提取,又要考虑头部结构的目标定位
(3) the search space requires to include compact and low-latency building blocks.
作者选择的技术路线是 One-Shot NAS
backbone supernet is pre-trained on ImageNet then fine-tuned with tracking data
b 是 backbone 的意思
p 是 pretrain
using tracking accuracy and model complexity as the supervision guidance
评估的时候只用抽 subnet,在 val 集上不断测试即可
h 是 head 结构,限制条件是训练时候最小化 loss,验证时候最大化验证集 acc
需要注意的是,抽子网络的搜索空间也很大,作者用了进化算法来降低搜索空间(evolutionary algorithms)—— One-Shot NAS 中采用的技术
NAS 的限制条件,model size and Flops
4.2.1 Search Space
MBConv 中 choices 为 6,kernel sizes of {3, 5, 7} and expansion rates of {4, 6}. 2 x 3 =6
backbone 的搜索空间 6 2 + 4 + 4 + 4 = 6 14 ≈ 7.8 × 1 0 10 6^{2+4+4+4} = 6^{14} \approx 7.8 \times 10^{10} 62+4+4+4=614≈7.8×1010,比较好理解
head 的搜索空间 ( 3 × 3 8 ) 2 ≈ 3.9 × 1 0 8 (3 \times 3^8)^2 \approx 3.9 \times 10^{8} (3×38)2≈3.9×108, kernel sizes of {3, 5} and channel numbers of {128, 192, 256}
第一个 DSConv choice 为 6,第二个 DS 的 channel 同第一个,choice 为 2,加上一个 skip 为 3
2 好理解是 cls 和 head
感觉作者写错了,应该是 ( 6 × 3 7 ) 2 = ( 2 × 3 8 ) 2 (6 \times 3^7)^2 = (2 \times 3^8)^2 (6×37)2=(2×38)2
4.2.2 Search Pipeline
Phase 1: Pre-training Backbone Supernet
Phase 2: Training Tracking Supernet
In each iteration, the optimizer updates one random path sampled from the backbone and head supernets.
cross-entropy loss for foreground-background classification and the IoU loss for object bounding-box regression.
Phase 3: Searching with Evolutionary Algorithm
The top-k architectures are picked as parents to generate child networks
For crossover, two randomly selected candidates are crossed to produce a new one.
For mutation, a randomly selected candidate mutates its every choice block with probability 0.1 to produce a new candidate
用进化算法(遗传算法)来缩减子网络的 search space 确实启发到了我——杂交变异
5 Experiments
1)Search
The training takes 30 epochs, and each epoch uses 6×105 image pairs.
前 5 个 epochs 采用 warm up,学习率 0.01->0.03,后面逐步降低至 0.0001
前 10 个 epcoch 只训练 head,主干被冻结,训练 head 的时候学习率是训练 head 的 1/10,比较小
2)Retrain
NAS 完了之后进行的,500 epochs on Imagenet
fine-tune the discovered backbone and head networks on the tracking data
crazy
3)Test
正常流程
5.1 Datasets and Metrics
- Youtube-BB
- ImageNet VID
- ImageNet DET
- COCO
- GOT-10K
metric 是 EAO
5.2 Results and Comparisons
文中好像没有说在哪个 tracking data 上 fine-tune 的,看看在各个数据集上的表现吧
(1)VOT-19
又小又猛
(2)GOT-10K
(3)TrackingNet
(4)LaSOT
(5)Speed
看看速度上的表现如何,在不同手机上不同算法之间的对比
5.3 Ablation and Analysis
NAS 要比 hand-craft 强
ImageNet pretrain 对比不 pretrain 效果好很多
完整结构展示
作者基于此结构的观察
(1)50% of the backbone blocks using MBConv with kernel size of 7x7,说明大的感受野对单目标跟踪精度定位的重要性
(2)The searched architecture chooses the second-last block as the feature output layer——图中好像看不出来——说明 search 分支 might not prefer high-level features
(3)The classification branch contains fewer layers than the regression branch
6 Conclusion(own) / Future work
- 主干孪生的话,深度可以设计的不一样的——宽度也可以
- cls 和 head 的结构可以不一样
- warm up + 先训练 head
- 用 one-shot NAS 来 one-shot detection
- 遗传算法来减少 subnet 的采样空间,👍
- NAS 用在 tracking 上,应用创新,说是这么说,工程上能调出来,就很棒