Action-Recogtion paper-reading

最新推荐文章于 2024-02-29 15:06:51 发布

涧野

最新推荐文章于 2024-02-29 15:06:51 发布

阅读量110

点赞数

本文链接：https://blog.csdn.net/weixin_40394400/article/details/108885525

版权

欢迎使用Markdown编辑器

你好！这是你第一次使用 Markdown编辑器 所展示的欢迎页。如果你想学习如何使用Markdown编辑器, 可以仔细阅读这篇文章，了解一下Markdown的基本语法知识。

新的改变

TSN(2016)
long-range temporal structure modeling

two-stream ConvNets+a sequence of short snippets+four types of input modalities,

the spatial stream ConvNet operates on a single RGB images, and the temporal stream ConvNet takes a stack of consecutive optical flow fields as input.

Instead of working on single frames or frame stacks, operate on a sequence of short snippets(K segments of equal durations) sparsely sampled from the entire video.Each snippet in this sequence will produce its own preliminary prediction of the action classes. Then a consensus among the snippets will be derived as the video-level prediction.

The segmental consensus function combines the outputs from multiple short snippets to obtain a consensus of class hypothesis among them,including evenly
averaging, maximum, and weighted averaging in our experiments

a single RGB image, stacked RGB difffference, stacked optical flow field, and stacked warped optical flow field.
RGB difffference between two consecutive frames describe the appearance change, which may correspond to the motion salient region.
extract the warped optical flow by first estimating homography matrix and then compensating camera motion. suppresses the background motion and makes motion concentrate on the actor.
the optical flflow is better at capturing motion information and sometimes RGB difffference may be unstable for describing motions. RGB difffference may serve as a low-quality, high-speed alternative for motion representations.
best: Optical Flow + Warped Flow + RGB(For the fusion take a weighted average of them.1:0.5:1)

a cross modality pre-training technique+two new data augmentation techniques: corner cropping(extracted regions are only selected from the corners or the center of the image) and scale jittering(fifix the size of input image or optical flflow fifields as 256×340, and the width and height of cropped region are randomly selected from {256, 224, 192, 168}. resized to 224 × 224 for network training.

I3D(2017)
Two-Stream Inflated 3D ConvNet
simply convert successful image (2D) classifification models into 3D ConvNets by starting with a 2D architecture, and inflflating all the fifilters and pooling kernels – endowing them with an additional temporal dimension. Filters N × N-> N × N × N. repeating the weights of the 2D fifilters N times along the time dimension, and rescaling them by dividingby N.

R(2+1)D(2018)
two new forms of spatiotemporal convolution.
mixed convolution (MC) , employing 3D convolutions only in the early layers of the network, with 2D conv in the top layers—motion modeling is a low/mid-level operation that can be implemented via 3D conv in the early layers of a network, and spatial reasoning over these mid-level motion features implemented by 2D convolutions in the top layers；
R(2+1)D, where we replace theNi 3D convolutional fifilters of sizeNi-1×t×d×d with a (2+1)D block consisting of Mi 2D convolutional filters of size
Ni -1 ×1×d×d andNi temporal convolutional fifilters of sizeMi×t×1×1. The hyperparameterMi determines the dimensionality of the intermediate subspace where the signal is projected between the spatial and temporal conv

two anvantages: Increase the numbers of nonlinearities ; easier optimization

Non-Local(2018)
non-local operations as a generic family of building blocks for capturing long-range dependencies.

classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions.
The set of positions can be in space, time, or spacetime, implying that our operations are applicable for image, sequence, and video problems
In videos, long-range interactions occur between distant pixels in space as well as time.

Non-local matching is also the essence of successful texture synthesis ,super-resolution , and inpainting algorithms
A self-attention module computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space.
Both flow and trajectories are off-the-shelf modules that may find long-range, non-local dependency.

A non-local operation is a flflexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end.

generic non-local operation in deep neural networks

g(x):For simplicity, g in the form of a linear embedding: g(xj) = Wgxj , where Wg is a weight matrix to be learned. This is implemented as, e.g., 1×1 convolution in space or 1×1×1 convolution in spacetime
f(x): Gaussian.
Embedded Gaussian
Dot product
Concatenation [·, ·] denotes concatenation

non-local block

residual connection,insert a new non-local block into any pre-trained model, without breaking its initial behavior
to make it more efficient: set the number of channels represented by Wg, Wθ, and Wφ to be half of the number of channels in x; subsampling trick (e.g.subsample x by pooing)

TSM(2019)

shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters.

Shift only a small portion of the channels for effificient temporal fusion instead of shifting all the channels. to cut down the data movement cost and more accurate( the performance reaches the peak when 1/4 (1/8 for each direction) of the channels are shifted).
Insert TSM inside residual branch rather than outside so that the activation of the current frame is preserved, which does not harm the spatial feature learning capability of the 2D CNN backbone.

SlowFast(2019)

a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast path way, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition.
to treat spatial structures and temporal events separately; explored the potential of different temporal speeds

attach one lateral connection between the two pathways for every “stage". Specifically for ResNets, these connections are right after pool1, res2, res3, and res4.

use non-degenerate temporal convolutions (temporal kernel size > 1) only in res4 and res5; all fifilters from conv1 to res3 are essentially 2D convolution kernels in this pathway.This is motivated by our experimental observation that using temporal convolutions in earlier layers degrades accuracy.

TIN(2020)

This work is built upon the idea of fusing a single frame with neighboring frames in the channel dimension. Temporal Fusion is completed through: 1. Shifting groups of channels. 2. Temporal attention with the offsets and weights learned from target tasks. To help the symmetric flow of information in the temporal dimension, adopt the strategy of reverse offsets.

fuse temporal information along the temporal dimension through inserting the module before each convolutional layer in the residual block

3 steps: 1 splits the input channel-wise feature into several groups, obtaining the offsets and weights of neighboring frames to mingle the temporal information. 2 apply the learned offsets to their respective groups through shifting operation and also interpolate the shifted feature along with temporal dimension. 3, we concatenate the split features and temporal-wisely aggregate them with the learned weights

Deformable Shift Module:
firstly squeeze global spatial information into a temporal channel descriptor.（3D Average Pooling）

OffsetNet: pooling->1D conv to aggregate the channel info -> 2 fc+ReLU to aggregate temporal info

                rescale the raw offset to the range of (-T/2,T/2)

WeightNet: 1D conv+ sigmoid
Differentiable Temporal-wise Frame Sampling: the input feature map U is split into 2 parts along the channel dimension: one is to be shifted by different offsets according to different groupings, while the rest remains un-shifted.
If have n groups, only learn the offsets of half groups n/2 , and the remained half are symmetrically derived by the previous offsets.
when concatenate the split groups of channels to V , the feature map is multiplied by the weight E along the temporal dimension.
我们对Markdown编辑器进行了一些功能拓展与语法支持，除了标准的Markdown编辑器功能，我们增加了如下几点新功能，帮助你用它写博客：

全新的界面设计 ，将会带来全新的写作体验；
在创作中心设置你喜爱的代码高亮样式，Markdown 将代码片显示选择的高亮样式 进行展示；
增加了 图片拖拽 功能，你可以将本地的图片直接拖拽到编辑区域直接展示；
全新的 KaTeX数学公式 语法；
增加了支持甘特图的mermaid语法¹ 功能；
增加了 多屏幕编辑 Markdown文章功能；
增加了 焦点写作模式、预览模式、简洁写作模式、左右区域同步滚轮设置 等功能，功能按钮位于编辑区域与预览区域中间；
增加了 检查列表 功能。

功能快捷键

撤销：Ctrl/Command + Z
重做：Ctrl/Command + Y
加粗：Ctrl/Command + B
斜体：Ctrl/Command + I
标题：Ctrl/Command + Shift + H
无序列表：Ctrl/Command + Shift + U
有序列表：Ctrl/Command + Shift + O
检查列表：Ctrl/Command + Shift + C
插入代码：Ctrl/Command + Shift + K
插入链接：Ctrl/Command + Shift + L
插入图片：Ctrl/Command + Shift + G
查找：Ctrl/Command + F
替换：Ctrl/Command + G

合理的创建标题，有助于目录的生成

直接输入1次#，并按下space后，将生成1级标题。
输入2次#，并按下space后，将生成2级标题。
以此类推，我们支持6级标题。有助于使用TOC语法后生成一个完美的目录。

如何改变文本的样式

强调文本 强调文本

加粗文本 加粗文本

标记文本

~~删除文本~~

引用文本

H₂O is是液体。

2¹⁰ 运算结果是 1024.

插入链接与图片

链接: link.

图片: Alt

带尺寸的图片:

居中的图片: Alt

居中并且带尺寸的图片:

当然，我们为了让用户更加便捷，我们增加了图片拖拽功能。

如何插入一段漂亮的代码片

去博客设置页面，选择一款你喜欢的代码片高亮样式，下面展示同样高亮的 代码片.

// An highlighted block
var foo = 'bar';

生成一个适合你的列表

项目
- 项目
  - 项目

项目1
项目2
项目3

计划任务
完成任务

创建一个表格

一个简单的表格是这么创建的：

项目	Value
电脑	$1600
手机	$12
导管	$1

设定内容居中、居左、居右

使用:---------:居中
使用:----------居左
使用----------:居右

第一列	第二列	第三列
第一列文本居中	第二列文本居右	第三列文本居左

SmartyPants

SmartyPants将ASCII标点字符转换为“智能”印刷标点HTML实体。例如：

TYPE	ASCII	HTML
Single backticks	`'Isn't this fun?'`	‘Isn’t this fun?’
Quotes	`"Isn't this fun?"`	“Isn’t this fun?”
Dashes	`-- is en-dash, --- is em-dash`	– is en-dash, — is em-dash