【LightTrack】《LightTrack：Finding Lightweight Neural Networks for Object Tracking via xxx》

bryant_meng

已于 2024-07-04 21:05:23 修改

阅读量949

点赞数 29

分类专栏： CNN / Transformer 文章标签：人工智能深度学习单目标跟踪 lighttrack one-shot NAS

于 2024-05-27 21:07:36 首次发布

本文链接：https://blog.csdn.net/bryant_meng/article/details/139236498

版权

CNN / Transformer 专栏收录该内容

210 篇文章 7 订阅

订阅专栏

在这里插入图片描述

《LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search》

CVPR-2021

1 Background and Motivation

基于深度学习的单目标跟踪发展迅速，但 increasingly heavy and expensive，limits their deployments in resource-constrained applications

本文采用 NAS 技术，自动搜索出更适合单目标跟踪的网络结构 LightTrack，性能提升明显，同时计算量和参数量都有所减少

2 Related Work

Object Tracking
more precise box estimation（anchor based or anchor free）
more powerful backbone
online update（ATOM / DiMP / ROAM）
Neural Architecture Search
search space to be continuous, such that the search can be optimized by the efficient gradient descent

3 Advantages / Contributions

在这里插入图片描述

分为主干和头部结构

template 和 search 是孪生结构，通过相关操作特征汇聚在一起

（1）用 one-shot NAS 自动搜索出更适合单目标跟踪的网络结构，LightTrack

（2）design a lightweight search space and a dedicated search pipeline for object tracking

4 Method

4.1 Preliminaries on One-Shot NAS

《Single Path One-Shot Neural Architecture Search with Uniform Sampling》（ECCV-2020）

来自 Single-Path One-Shot NAS遗传算法代码解读和编程技巧

在这里插入图片描述

（a）为了减少权重之间的耦合度，在每个Choice Block选择的时候必定会选择其中的一个choice——均匀采样策略，不存在恒等映射
（b）提出了一个基于权重共享的choice block, 其核心思想是预先分配一个最大通道个数的，然后随机选择通道个数，切分出对应的权重进行训练。通过权重共享策略，发现超网可以快速收敛。
来自【神经网络搜索】Single Path One Shot

看看公式化表达

在这里插入图片描述

$\mathcal{N}$ 表示 supernet

$\mathcal{A}$ architecture search space——作者引入了 mobilenetv1 的 depthwise separable convolutions（【MobileNet】《MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications》）和 mobilenetv2 的 inverted residual structure（【MobileNet V2】《MobileNetV2：Inverted Residuals and Linear Bottlenecks》）

$W$ 是网络的权重

subnets $\alpha \in \mathcal{A}$ in $\mathcal{N}$

在这里插入图片描述

Only the weights of the single supernet N need to be trained

single-path uniform sampling strategy

只训练 supernet，subnet 是 supernet 中的均匀采样抽出来的子结构，

in each batch, only one random path is sampled for feedforward and backward propagation, while other paths are frozen

4.2 Tracking via One-Shot NAS

NAS 用在单目标跟踪的挑战

（1）需要考虑主干网络在 ImageNet 上的 pretrain，也要考虑其在 tracking data 上的微调——后续实验证明不在 ImageNet 上 pretrain 效果很差

（2）搜索出来的结构既要考虑主干网络的特征提取，又要考虑头部结构的目标定位

（3） the search space requires to include compact and low-latency building blocks.

作者选择的技术路线是 One-Shot NAS

backbone supernet is pre-trained on ImageNet then fine-tuned with tracking data
在这里插入图片描述
b 是 backbone 的意思

p 是 pretrain

using tracking accuracy and model complexity as the supervision guidance

评估的时候只用抽 subnet，在 val 集上不断测试即可
在这里插入图片描述
h 是 head 结构，限制条件是训练时候最小化 loss，验证时候最大化验证集 acc

需要注意的是，抽子网络的搜索空间也很大，作者用了进化算法来降低搜索空间（evolutionary algorithms）—— One-Shot NAS 中采用的技术

NAS 的限制条件，model size and Flops
在这里插入图片描述

4.2.1 Search Space

在这里插入图片描述

MBConv 中 choices 为 6，kernel sizes of {3, 5, 7} and expansion rates of {4, 6}. 2 x 3 =6

backbone 的搜索空间 $6^{2+4+4+4} = 6^{14} \approx 7.8 \times 10^{10}$ ，比较好理解

head 的搜索空间 $\times 3^8)^2 \approx 3.9 \times 10^{8}$ ， kernel sizes of {3, 5} and channel numbers of {128, 192, 256}

第一个 DSConv choice 为 6，第二个 DS 的 channel 同第一个，choice 为 2，加上一个 skip 为 3

2 好理解是 cls 和 head

感觉作者写错了，应该是 $\times 3^7)^2 = (2 \times 3^8)^2$

4.2.2 Search Pipeline

在这里插入图片描述

Phase 1: Pre-training Backbone Supernet

Phase 2: Training Tracking Supernet

In each iteration, the optimizer updates one random path sampled from the backbone and head supernets.

cross-entropy loss for foreground-background classification and the IoU loss for object bounding-box regression.

Phase 3: Searching with Evolutionary Algorithm

The top-k architectures are picked as parents to generate child networks

For crossover, two randomly selected candidates are crossed to produce a new one.

For mutation, a randomly selected candidate mutates its every choice block with probability 0.1 to produce a new candidate

用进化算法（遗传算法）来缩减子网络的 search space 确实启发到了我——杂交变异

在这里插入图片描述

5 Experiments

1）Search

The training takes 30 epochs, and each epoch uses 6×105 image pairs.

前 5 个 epochs 采用 warm up，学习率 0.01->0.03，后面逐步降低至 0.0001

前 10 个 epcoch 只训练 head，主干被冻结，训练 head 的时候学习率是训练 head 的 1/10，比较小

2）Retrain

NAS 完了之后进行的，500 epochs on Imagenet

fine-tune the discovered backbone and head networks on the tracking data

crazy

3）Test

正常流程

5.1 Datasets and Metrics

Youtube-BB
ImageNet VID
ImageNet DET
COCO
GOT-10K

metric 是 EAO

5.2 Results and Comparisons

文中好像没有说在哪个 tracking data 上 fine-tune 的，看看在各个数据集上的表现吧

（1）VOT-19
在这里插入图片描述
又小又猛

（2）GOT-10K
在这里插入图片描述
（3）TrackingNet

（4）LaSOT

在这里插入图片描述
（5）Speed

看看速度上的表现如何，在不同手机上不同算法之间的对比
在这里插入图片描述

5.3 Ablation and Analysis

NAS 要比 hand-craft 强

在这里插入图片描述
ImageNet pretrain 对比不 pretrain 效果好很多

完整结构展示

在这里插入图片描述

作者基于此结构的观察

（1）50% of the backbone blocks using MBConv with kernel size of 7x7，说明大的感受野对单目标跟踪精度定位的重要性

（2）The searched architecture chooses the second-last block as the feature output layer——图中好像看不出来——说明 search 分支 might not prefer high-level features

（3）The classification branch contains fewer layers than the regression branch