论文笔记：Massively Parallel Video NetworksUntitled

最新推荐文章于 2024-03-20 20:50:45 发布

eight_Jessen

最新推荐文章于 2024-03-20 20:50:45 发布

阅读量255

点赞数

分类专栏：论文笔记文章标签：机器学习数据挖掘神经网络深度学习

本文链接：https://blog.csdn.net/eight_Jessen/article/details/107944551

版权

论文笔记专栏收录该内容

49 篇文章 7 订阅

订阅专栏

1.Introduction

Pipelining schemes tailored to sequence models (we call this predictive depth-parallelism)
Show how such architectures can be augmented using multi-rate clocks and how they benefit from skip connections.
It is possible to get better parallel models by distilling them from sequential ones
Explore other wiring patterns – temporal filters and feedback –that improve the expressivity of the resulting models.

2. Related work

Rely on image models, executed frame-by-frame (speed up, simplifying the models, fewer parameters, low-bit representation formats)
Propagate information between time step.

Proposed periodically warping old activations given fresh external optical flow as input, rather than recomputing them.
The author’s does not require external inputs nor special warping modules. Instead, it places the burden on learning.

Consider the video as a volume by stacking the frames and applying 3D convolutions to extract spatio-temporal features

(large scale due to the use of larger temporal convolution strides at deeper layers)

(not causel, extract features from future frames, challenge to use in real-time)
Hierarchical architectures with clocks, attaching to each module a possibly different clock rate, yielding temporally multi-scale models that scale better to long sequences

The clock rates can be hard-coded or learnt from data

Reduce latency

If the available time runs out before the data has traversed the entire network, then emergency exits are used to output whatever prediction have been computed thus far.

Pipelining strategies (training, inference time)

3.Efficient online video models

在这里插入图片描述

Depth-paralledl networks

In basic depth-sequential video models, the input to each layer is the output of the previous layer at the same time step, and the network outputs a prediction only after all the layers have processed in sequence the current frame; see fig. 1 (a).

In the proposed design, every layer in the network processes its input, passes the activations to the next layer, and immediately starts processing the next input available, without waiting for the whole network to finish computation for the current frame; fig. 1 (b).
(每层的输入是同一时间步的前一层的输出，并且网络仅在所有层按顺序处理当前帧之后输出预测)
(网络中的每个层处理其输入，将激活传递到下一层，并立即开始处理下一个可用输入，而无需等待整个网络完成当前帧的计算)

Latency and throughput

Prediction latency

For the sequential model, throughput is roughly the inverse of the computational latency, hence the deeper the model,
the higher the computational latency and the lower the throughput.

A quality of the proposed depth-parallel models: irrespective of the depth, the model can now make predictions at the rate of its slowest layer.

Information latency

The number of frames it takes before the input signal reaches the output layer along the network’s shortest path

Whenever the prediction latency is smaller than the information latency, the network must make a prediction for an input that it did not process yet completely.

The higher the information latency, the more challenging it is to operate with prediction latency of zero

Employ temporal skip connections to minimise the information latency of the different layers in the network

Pipelined operations and temporal receptive field

在这里插入图片描述

Symmetric triangular shape
Skewed triangular shape
A skewed triangular shape
(沿着网络深度的时间接收场具有倾斜的三角形形状，较浅的层可以访问较深层无法看到的帧（信息等待时间）。例如，图2（c）所示，最深层在时间t = 0时可以看到的最新帧是帧I -4，假设时间内核为3，由于我们将预测延迟定义为零，意味着它必须预测输出提前4帧。添加时间跳过连接可减少信息延迟;在极端情况下，接受场变得类似于因果场，使其为零。)

Levels of parallelism

An image model with a linear-chain layer-architecture ---------------> a semi-parallel video model

Traverse the network starting from the first layer, and group together contiguous layers into sequential blocks of k layers that we will call parallel subnetworks and which can execute independently

在这里插入图片描述

3.1 Multi-rate clocks

Fast varying observations can be explained by slow varying latent factors

This can be implemented by having multi-rate clocks: whenever the clock of a layer does not tick, that layer does not compute activations, instead it reuses the existing ones.

3D ConvNets implement this principle by using temporal strides but does not keep state and hence cannot efficiently
operate frame-by-frame.

The author’s recurrent settings:

Multi-rate clocks can be implemented by removing nodes from the unrolled graph and preserving an internal state to cache outputs until the next slower-ticking layer can consume them.
Used a set of fixed rates in our models, typically reducing clock rates by a factor of two whenever spatial resolution is halved.
Instead of just using identity to create the internal state as we did, one could use any spatial recurrent module

(可以使用任何空间循环模块，而不仅仅使用标识来创建内部状态)

3.2 Temporal filters and feedback

One way to make learning easier would seem to be by using units with temporal filters.
We illustrate the use of temporal filters in fig. 4, (b) as temporalisation. Interestingly, depth-paralellisation by itself also induces temporalisation in models with skip connections.
A feedback connection – the outputs of the previous frame are fed as inputs to the early layers of the network.

3.3 Sequential-to-parallel “distillation”

Reduce latency; computational depth reduced
re-use features from previous states through the multi-rate clocks mechanism
[44],[45],a teacher network is privileged relative to a student network, either due to having a greater capacity or (in the case of Ladder networks) access to greater amounts of information.

(具有更大的容量或者（在梯形网络的情况下）访问更大量的信息)
Consider the sequential model as the teacher
since all of its layers always have access to fresh features extracted from the current frame

First train a causal fully-sequential model with the same overall architecture as the parallel model.
Modify the loss of the parallel model to encourage its activations to match those of the sequential model for some given layers

(修改并行模型的损失，以鼓励其激活与某些给定层的顺序模型匹配)
Used the average of this new loss over m = 3 layers

$L_d = L(y_a,y_{gt}) + \lambda\sum_{i=1}^m \frac{1}{n_i}||\hat{a}^{(i)} - a^{(i)}||$

$L(y_a,y_{gt})$ is the initial cross-entropy loss between the predictions of the parallel network y and the ground truth $y_{gt}$
The second term is the normalised Euclidean distance between the activations of the pre-trained sequential model $\hat{a}^{(i)}$ for layer i and the activation of the parallel model $a^{(i)}$ for the same layer; $n_i$ denotes the number of feature channels of layer i.
$\lambda$ to weight the two components of the new loss. Set λ = 1 for the dense keypoint prediction and λ = 100 for action recognition.