CVPR2019 笔记: Timeception for Complex Action Recognition

Timeception for Complex Action Recognition

Noureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders Timeception for Complex Action Recognition CVPR, 2019
学习时,别忘了总是要问自己一个为什么


前言:

这篇文章我只是粗读了第一遍,接下来还需要两天的时间来精读(结合代码与实验环节,来研究具体细节),目前我从大体上理解了文章的目的,网络结构的目的与思路以及文章写作手法上的结构思路.

我认为这篇文章对我有用的关键点在于怎样提出一个问题并试图找到解决问题的方法.
提出的问题通常来说应该是普遍且合理的,正如本文来说,在日常情况下,所谓的"action" 通常是由多个简单action组成的长时间的action,但是解决这个长时间的action的办法却没有被很好的提出来.(切入点)

1. Problems and Solutions

在阐述了两种action概念上的不同点后,可以从从 模型, 当前数据集以及当前task的难点角度概括提出的问题.并针对相应的问题提出了解决办法
ordinary life: complex actions vs. one-actions

  • complex actions:

    • composed of several one-actions
    • large variations in temporal duration and tem-poral order
    • takes much longer to unfold
  • one actions:

    • exhibit one visual pattern, possibly repetitive
    • usually short in time, homogeneous in motion and coherent in form
1.1 problems

Model:

  • Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling.

Data:

  • main focus is the recognition of short-range actions like in HMDB, UCF and Kinetics . Few attention has been paid to the recognion of long-range and complex actions,

Task:

  1. minute-long temporal modeling while maintaining attention to seconds-long details
  2. tolerating varia-tions in temporal extent and temporal order of one-actions
1.2 Solutions

Model:

  • use multi-scale temporal convolution, Timeception convolution layers

Data:

  • use Charades ,Breakfast Actions, MultiTHUMOS

Task:
present Timeception:

  1. learns long-range temporal depen-dencies with attention to short-range details(dedicated only for temporal modeling)
  2. it toler-ates the differences in temporal extent of one-actions com-prising the complex action

2. Innovations

文章总结了三个主要的创新点

  1. introduce a con-volutional temporal layer effectively and efficiently learn minute-long action ranges of 1024 timesteps, a factor of 8 longer than best related work
  2. introduce multi-scale temporal kernels to account for large variations in duration of action components
  3. use temporalonly convolutions, which are better suited for complex actions than spatiotemporal counterparts

3. Background

背景方面,文章分成了四个关联的小块进行探讨,模块的时间程度 由旧到新, 相关程度 由轻到重.以下为我对四个模块的总结:
Temporal Modeling

downside:neglecting temporal patterns
Video vs. image
temporal dimension
statistical pooling
neural methods

Short-range Action Recognition

with shallow motion feature
too computationally expensive
deep appearance features
frame-level,2D
complement
evolve from 2D to 3D

Long-range Action Recognition

tem-poral pattern
learn video-wide represen-tation
learns relations between several video segments
learns temporal structure
different temporal resolutions
self-attention

Convolution Decomposition

channel shuffling
to solve computational complexity
separable 2D convolution
separable 2+1D convolutions
1x1 2Dconvolution
3x3 2D spatial convolution
models cross-channel correlation
grouped convolutions
multi-scale 2D spatial kernels

4. Structures

4.1 inspiration:

spatiotemporal kernel is decomposalble: w ∝ w s × w t w \propto w_s \times w_t wws×wt
-> namely w ~ = w α × w β × w γ × . . . \widetilde w = w_{\alpha} \times w_{\beta} \times w_{\gamma} \times ... w =wα×wβ×wγ×...

4.2 three intuitive design principles:

Subspace Modularity
Subspace Balance
Subspace Efficiency

4.3 layer structure
  1. denpendency
  2. long-range
  3. temporal extent(one action’s length can range a lot)

设计这个layer模块的初衷应该就是通过decomposition 来增加模型在时间维度上的算力.

  • 左边的模块是它的整体,通过group convolution在channel wise上分成了N份,同时保持T wise不变. 这样减小了整体的complexity.
  • 在group conv 中的 Temporal Conv Module 是一个只在Time wise 上作用的
    Depth-wise 卷积(这个卷积只针对每个channel层的 time方向)
  • 然后在concat完结果之后,通过针对channel方向上的shuffle 来学习cross group correlation(这个group是channel group),并且还可以进一步reduce complexity

在右边的具体Temporal Conv Module结构上:
整体布局类似于inception 网络,其实就像是inception layer 在时间方向上的一次尝试.

  • 主要分成两层,第一层是 multi-scale kernel,—>主要作用应该是通过在t方向上多尺度的卷积来tolerate的temporal extents
  • 第二层的作用可能是对channel-wise做进一步的降维使最终channel的维度降为原来的 5/M

目的有点类似 :

  1. 为了可以学习long -time video, 来通过decomposition 获取t方向的算力
  2. 为了tolerate temporal extent,来做成multi-scale kernel

final model consists four Timeception layers stacked on top of the last convolution layer of a CNN

5. Code

github
这一部分应该与experiment进行结合,进一步研究

6. What’s behind

  • Whether I could use the neural network/pipline ?
  • Whether I could learn the way to propose an idea?
  • Whether I could learn the way to solve a problem?
  • Whether I could learn the writing skill?

Idea of thinking up of an task --> find solution and writing skill

6.1 Aim

intuitively feasible
我不认为他的结构是一次性成功的,肯定也有在推论导向下的不断试错的过程.
这些推论可能是以之前的文章为基础. 然后在之前基础上做了一个最重要的尝试就是把原有的类似结构拓展到T dimension上.

目的: 通过decomposition 来增加模型在时间维度上的算力

  1. 为了可以学习long -time video, 来通过decomposition 获取t方向的算力
  2. 为了tolerate temporal extent,来做成multi-scale kernel
6.2 Writing style:
  • style: discussing while proposing (conventional method vs. what’s new)

  • structure:

    • Abstract : summarize 1.focuses, 2. downside 3. innovation 4. dataset
    • Intruduction :
      1. phenomenon to bring “problem”
      2. propose “problems” while discussing and analysing
      3. novelty list
    • Background:
      • related work from different subsections.
      • subsection 1 … subsection 4
      • bring out what you’ve learned and inspired by
    • Method:
      • motivation and inspiration
      • structure explanation
    • Experiments
    • Conclusion
6.3 Idea
  1. There is not much of the mathmatic formular to support the idea.
    Try to use explanation to have a better expression for what you are doing
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值