【论文翻译】X3D: Expanding Architectures for Efficient Video Recognition

最新推荐文章于 2023-02-03 19:30:23 发布

CSPhD-winston-杨帆

最新推荐文章于 2023-02-03 19:30:23 发布

阅读量1k

点赞数 1

文章标签：计算机视觉机器学习人工智能视频检测行为理解

本文链接：https://blog.csdn.net/WhiffeYF/article/details/112800201

版权

参考

X3D: Expanding Architectures for Efficient Video Recognition个人论文笔记

X3D: Expanding Architectures for Efficient Video Recognition

扩展有效的视频识别架构

Abstract

This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.

本文介绍了X3D，一组高效的视频网络，它沿着多个网络轴，在空间、时间、宽度和深度上逐步扩展微小的二维图像分类体系结构。

Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved.

受机器学习中的特征选择方法的启发，我们采用了一种简单的逐步网络扩展方法，在每一步中扩展一个轴，从而实现了对复杂性的良好权衡。

To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction.

为了将X3D扩展到特定的目标复杂性，我们执行渐进的向前扩展，然后向后收缩。

X3D achieves state-of-the-art performance while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy as previous work.

X3D实现了最先进的性能，同时需要4.8×和5.5×更少的乘加和参数类似的精度与以前的工作。

Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters.

我们最令人惊讶的发现是，具有高时空分辨率的网络可以表现得很好，但在网络宽度和参数方面却非常轻。

We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code will be available at: https://github.com/facebookresearch/SlowFast.

我们报告了在视频分类和检测基准上具有空前效率的竞争性准确性。代码可以在https://github.com/facebookresearch/SlowFast找到。

1. Introduction

Neural networks for video recognition have been largely driven by expanding 2D image architectures [29, 47, 64, 71] into spacetime.

用于视频识别的神经网络在很大程度上是通过将2D图像架构[29,47,64,71]扩展到时空来驱动的。

Naturally, these expansions often happen along the temporal axis, involving extending the network inputs, features, and/or filter kernels into spacetime (e.g. [7, 13, 17, 42, 56, 75]); other design decisions—including depth (number of layers), width (number of channels), and spatial sizes—however, are typically inherited from 2D image architectures.

自然地，这些扩展通常沿着时间轴发生，包括将网络输入、特征和/或滤波器内核扩展到时空(例如[7,13,17,42,56,75]);然而，其他设计决策——包括深度(层数)、宽度(通道数)和空间大小——通常都继承自2D图像架构。

While expanding along the temporal axis (while keeping other design properties) generally increases accuracy, it can be sub-optimal if one takes into account the computation/accuracy trade-off—a consideration of central importance in applications.

虽然沿时间轴扩展(同时保留其他设计属性)通常会提高精度，但如果考虑到计算/精度的权衡(考虑在应用程序中的中心重要性)，它可能是次优的。

In part because of the direct extension of 2D models to 3D, video recognition architectures are computationally heavy. In comparison to image recognition, typical video models are significantly more compute-demanding, e.g. an image ResNet [29] can use around 27× fewer multiply-add operations than a temporally extended video variant [81].

部分由于直接将2D模型扩展到3D模型，视频识别体系结构的计算量很大。与图像识别相比，典型的视频模型明显需要更多的计算，例如，与临时扩展的视频变体相比，图像ResNet[29]可以使用大约27×更少的乘加操作[81]。

在这里插入图片描述
Figure 1. X3D networks progressively expand a 2D network across the following axes: Temporal duration $γ_t$ , frame rate $γ_τ$ , spatial resolution $γ_s$ , width $γ_w$ , bottleneck width $γ_b$ , and depth $γ_d$ .

图1所示。X3D网络在以下轴上逐步扩展2D网络:时间持续 $γ_t$ ，帧速率 $γ_τ$ ，空间分辨率 $γ_s$ ，宽度 $γ_w$ ，瓶颈宽度 $γ_b$ 和深度 $γ_d$ 。

This paper focuses on the low-computation regime in terms of computation/accuracy trade-off for video recognition.

本文在视频识别的计算/精度权衡方面重点研究了低计算机制。

We base our design upon the “mobile-regime" models [31, 32, 61] developed for image recognition.

我们的设计基于为图像识别而开发的“移动区域”模型[31,32,61]。

Our core idea is that while expanding a small model along the temporal axis can increase accuracy, the computation/accuracy trade-off may not always be best compared with expanding other axes, especially in the low-computation regime where accuracy can increase quickly along different axes.

我们的核心思想是，虽然沿时间轴扩展一个小模型可以提高精度，但与扩展其他轴相比，计算/精度的权衡可能并不总是最好的，特别是在低计算的情况下，精度可以沿不同轴快速提高。

In this paper, we progressively “expand" a tiny base 2D image architecture into a spatiotemporal one by expanding multiple possible axes shown in Fig. 1. The candidate axes are temporal duration $γ_t$ , frame rate $γ_τ$ , spatial resolution $γ_s$ , network width $γ_w$ , bottleneck width $γ_b$ , and depth $γ_d$ .

在本文中，我们通过扩展图1中所示的多个可能轴，逐步地将微小的二维图像结构“扩展”成一个时空结构。候选轴为时间持续 $γ_t$ 、帧速率 $γ_τ$ 、空间分辨率 $γ_s$ 、网络宽度 $γ_w$ 、瓶颈宽度 $γ_b$ 和深度 $γ_d$ 。

The resulting architecture is referred as X3D (Expand 3D) for expanding from the 2D space into 3D spacetime domain.

由此产生的体系结构称为X3D(扩展3D)，用于从2D空间扩展到3D时空域。

The 2D base architecture is driven by the MobileNet [31,32,61] core concept of channel-wise1 separable convolutions, but is made tiny by having over 10× fewer multiply-add operations than mobile image models.

2D base架构是由channel-wise1可分离卷积的核心概念MobileNet[31,33,61]驱动的，但由于比移动图像模型少10倍的乘加操作而变得非常微小。

Our expansion then progressively increases the computation (e.g., by 2×) by expanding only one axis at a time, train and validate the resultant architecture, and select the axis that achieves the best computation/accuracy trade-off.

然后，我们的扩展逐步增加计算(例如，2×)，每次只扩展一个轴，训练和验证结果架构，并选择实现最佳计算/精度权衡的轴。

The process is repeated until the architecture reaches a desired computational budget.

重复这个过程，直到体系结构达到所需的计算预算。

This can be interpreted as a form of coordinate descent [83] in the hyper-parameter space defined by those axes.

这可以解释为这些轴定义的超参数空间中的坐标下降形式[83]。

Our progressive network expansion approach is inspired by the history of image ConvNet design where popular architectures have arisen by expansions across depth, [8,29,47,64,71,94], resolution [35,70,73] or width [88,93], and classical feature selection methods [25, 41, 44] in machine learning.

我们的渐进网络扩展方法受到图像卷积网络设计历史的启发，在图像卷积网络设计的历史中，流行的架构出现在机器学习中的深度、[8,29,47,64,71,94]、分辨率[35,70,73]或宽度[88,93]和经典特征选择方法[25,41,44]的扩展中。

In the latter, progressive feature selection methods [25, 44] start with either a set of minimum features and aim to find relevant features to improve in a greedy fashion by including (forward selection) a single feature in each step, or start with a full set of features and aim to find irrelevant ones that are excluded by repeatedly deleting the feature that reduces performance the least (backward elimination).

在后者,进步的特征选择方法(25,44岁)开始与一组最小特性和目标是找到相关特性来提高贪婪的方式包括(向前)选择一个特征在每个步骤中,或从一个完整的功能和目标无关的那些被反复删除功能,减少排除性能至少(反向淘汰)。

To compare to previous research, we use Kinetics-400[43], Kinetics-600 [4], Charades [62] and AVA [24]. For systematic studies, we classify our models into different levels of complexity for small, medium and large models.

与之前的研究相比，我们使用了Kinetics-400[43]、Kinetics-600[4]、Charades[62]和AVA[24]。为了进行系统研究，我们将模型分为小型、中型和大型模型的不同复杂级别。

Overall, our expansion produces a sequence of spatiotemporal architectures, covering a wide range of computation/accuracy trade-offs.

总的来说，我们的扩展产生了一系列时空架构，涵盖了广泛的计算/精度权衡。

They can be used under different computational budgets that are application-dependent in practice.

它们可以在不同的计算预算下使用，这些计算预算在实践中依赖于应用程序。

For example, across different computation and accuracy regimes X3D performs favorably to state-of-theart while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy as previous work.

例如，在不同的计算和精度机制下，X3D表现良好，但需要4.8×和5.5×更少的乘法和参数，以达到与之前工作相似的精度。

Further, expansion is simple and cheap e.g. our low-compute model is completed after only training 30 tiny models that accumulatively require over 25× fewer multiply-add operations for training than one large state-of-the-art network [15, 81, 84].

此外，扩展简单且廉价，例如，我们的低计算模型仅在训练30个小模型后就完成了，而训练这些小模型累积需要的乘加操作比一个大型最先进的网络少25倍[15,81,84]。

Conceptually, our most surprising finding is that very thin video architectures that are created by expanding spatiotemporal resolution perform well, while being light in terms of network width and parameters.

从概念上讲，我们最令人惊讶的发现是，通过扩展时空分辨率创建的非常薄的视频架构表现良好，而在网络宽度和参数方面较轻。

X3D networks have lower width than image-design [29, 64, 71] based video models,making X3D similar to the high-resolution Fast pathway [15] which has been designed in such fashion.

X3D网络的宽度小于基于图像设计[29,64,71]的视频模型，这使得X3D类似于以这种方式设计的高分辨率快速通道[15]。

We hope these advances will facilitate future research and applications.

我们希望这些进展将促进未来的研究和应用。

CSPhD-winston-杨帆

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【论文翻译】X3D: Expanding Architectures for Efficient Video Recognition

X3D: Expanding Architectures for Efficient Video Recognition扩展有效的视频识别架构AbstractThis paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, t
复制链接

扫一扫