CVPR2019 笔记: Timeception for Complex Action Recognition

最新推荐文章于 2024-07-26 10:16:30 发布

One__Way

最新推荐文章于 2024-07-26 10:16:30 发布

阅读量1.3k

点赞数 1

本文链接：https://blog.csdn.net/wangweiwells/article/details/102728826

版权

deep learning 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

Timeception for Complex Action Recognition

Noureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders Timeception for Complex Action Recognition CVPR, 2019
学习时，别忘了总是要问自己一个为什么

前言:

这篇文章我只是粗读了第一遍,接下来还需要两天的时间来精读(结合代码与实验环节,来研究具体细节),目前我从大体上理解了文章的目的,网络结构的目的与思路以及文章写作手法上的结构思路.

我认为这篇文章对我有用的关键点在于怎样提出一个问题并试图找到解决问题的方法.
提出的问题通常来说应该是普遍且合理的,正如本文来说,在日常情况下,所谓的"action" 通常是由多个简单action组成的长时间的action,但是解决这个长时间的action的办法却没有被很好的提出来.(切入点)

1. Problems and Solutions

在阐述了两种action概念上的不同点后,可以从从模型, 当前数据集以及当前task的难点角度概括提出的问题.并针对相应的问题提出了解决办法
ordinary life: complex actions vs. one-actions

complex actions:
- composed of several one-actions
- large variations in temporal duration and tem-poral order
- takes much longer to unfold
one actions:
- exhibit one visual pattern, possibly repetitive
- usually short in time, homogeneous in motion and coherent in form

1.1 problems

Model:

Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling.

Data:

main focus is the recognition of short-range actions like in HMDB, UCF and Kinetics . Few attention has been paid to the recognion of long-range and complex actions,

Task:

minute-long temporal modeling while maintaining attention to seconds-long details
tolerating varia-tions in temporal extent and temporal order of one-actions

1.2 Solutions

Model:

use multi-scale temporal convolution, Timeception convolution layers

Data:

use Charades ,Breakfast Actions, MultiTHUMOS

Task:
present Timeception:

learns long-range temporal depen-dencies with attention to short-range details(dedicated only for temporal modeling)
it toler-ates the differences in temporal extent of one-actions com-prising the complex action

2. Innovations

文章总结了三个主要的创新点

introduce a con-volutional temporal layer effectively and efficiently learn minute-long action ranges of 1024 timesteps, a factor of 8 longer than best related work
introduce multi-scale temporal kernels to account for large variations in duration of action components
use temporalonly convolutions, which are better suited for complex actions than spatiotemporal counterparts

3. Background

背景方面,文章分成了四个关联的小块进行探讨,模块的时间程度由旧到新, 相关程度由轻到重.以下为我对四个模块的总结:
Temporal Modeling

Short-range Action Recognition

Long-range Action Recognition

Convolution Decomposition

4. Structures

4.1 inspiration:

spatiotemporal kernel is decomposalble: $\propto w_s \times w_t$
-> namely $\widetilde w = w_{\alpha} \times w_{\beta} \times w_{\gamma} \times ...$

4.2 three intuitive design principles:

Subspace Modularity
Subspace Balance
Subspace Efficiency

4.3 layer structure

denpendency
long-range
temporal extent(one action’s length can range a lot)

设计这个layer模块的初衷应该就是通过decomposition 来增加模型在时间维度上的算力.

左边的模块是它的整体,通过group convolution在channel wise上分成了N份,同时保持T wise不变. 这样减小了整体的complexity.
在group conv 中的 Temporal Conv Module 是一个只在Time wise 上作用的
Depth-wise 卷积(这个卷积只针对每个channel层的 time方向)
然后在concat完结果之后,通过针对channel方向上的shuffle 来学习cross group correlation(这个group是channel group),并且还可以进一步reduce complexity

在右边的具体Temporal Conv Module结构上:
整体布局类似于inception 网络,其实就像是inception layer 在时间方向上的一次尝试.

主要分成两层,第一层是 multi-scale kernel,—>主要作用应该是通过在t方向上多尺度的卷积来tolerate的temporal extents
第二层的作用可能是对channel-wise做进一步的降维使最终channel的维度降为原来的 5/M

目的有点类似 :

为了可以学习long -time video, 来通过decomposition 获取t方向的算力
为了tolerate temporal extent,来做成multi-scale kernel

final model consists four Timeception layers stacked on top of the last convolution layer of a CNN

5. Code

github
这一部分应该与experiment进行结合,进一步研究

6. What’s behind

Whether I could use the neural network/pipline ?
Whether I could learn the way to propose an idea?
Whether I could learn the way to solve a problem?
Whether I could learn the writing skill?

Idea of thinking up of an task --> find solution and writing skill

6.1 Aim

intuitively feasible
我不认为他的结构是一次性成功的,肯定也有在推论导向下的不断试错的过程.
这些推论可能是以之前的文章为基础. 然后在之前基础上做了一个最重要的尝试就是把原有的类似结构拓展到T dimension上.

目的: 通过decomposition 来增加模型在时间维度上的算力

为了可以学习long -time video, 来通过decomposition 获取t方向的算力
为了tolerate temporal extent,来做成multi-scale kernel

6.2 Writing style:

style: discussing while proposing (conventional method vs. what’s new)
structure:
- Abstract : summarize 1.focuses, 2. downside 3. innovation 4. dataset
- Intruduction :
  1. phenomenon to bring “problem”
  2. propose “problems” while discussing and analysing
  3. novelty list
- Background:
  - related work from different subsections.
  - subsection 1 … subsection 4
  - bring out what you’ve learned and inspired by
- Method:
  - motivation and inspiration
  - structure explanation
- Experiments
- Conclusion

6.3 Idea

There is not much of the mathmatic formular to support the idea.
Try to use explanation to have a better expression for what you are doing

One__Way

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
CVPR2019 笔记: Timeception for Complex Action Recognition

Timeception for Complex Action RecognitionNoureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders Timeception for Complex Action Recognition CVPR, 2019学习时，别忘了总是要问自己一个为什么前言:这篇文章我只是粗读了第一遍,接下...
复制链接

扫一扫

专栏目录