行为识别论文笔记|SlowFast|SlowFast Networks for Video Recognition

行为识别论文笔记|SlowFast|SlowFast Networks for Video Recognition

Feichtenhofer, Christoph, et al. “Slowfast networks for video recognition.” Proceedings of the IEEE international conference on computer vision. 2019. ——FAIR

Motivations

The categorical spatial semantics of the visual content often evolve slowly.

The motion being performed can evolve much faster than their subject identities

Biological studies on the retinal ganglion cells in the primate visual system:∼80%
are Parvocellular (P-cells) and ∼15-20% are Magnocellular (M-cells).

Solutions

  • Slow pathway: with large temporal stride (16) on input frames (roughly 2 frames sampled per second for a 30-fps video)

  • Fast pathway: with small temporal stride (16/8=2), no temporal downsampling layers for also high feature resolution (8xT frames)

    在这里插入图片描述

    Why 1/8? the Fast pathway typically takes ∼20% of the total computation. Interestingly, s mentioned in Sec. 1, evidence suggests that ∼15-20% of the retinal cells in the primate visual system are M-cells (that are sensitive to fast motion but not color or spatial detail).

  • Lateral connections: these connections are right after pool1, res2, res3, and res4

    • Slow: { T , S 2 , C T, S^2, C T,S2,C}
    • Fast: { α T , S 2 , β C \alpha T, S^2, \beta C αT,S2,βC} ------3DCNN with temporal stride= α \alpha α------> { T , S 2 , 2 β C T,S^2,2\beta C T,S2,2βC}
def FastPath(self, input):
    lateral = []
    x = self.fast_conv1(input)
    x = self.fast_bn1(x)
    x = self.fast_relu(x)
    pool1 = self.fast_maxpool(x)
    lateral_p = self.lateral_p1(pool1)
    lateral.append(lateral_p)

    res2 = self.fast_res2(pool1)
    lateral_res2 = self.lateral_res2(res2)
    lateral.append(lateral_res2)

    res3 = self.fast_res3(res2)
    lateral_res3 = self.lateral_res3(res3)
    lateral.append(lateral_res3)

    res4 = self.fast_res4(res3)
    lateral_res4 = self.lateral_res4(res4)
    lateral.append(lateral_res4)

    res5 = self.fast_res5(res4)
    x = nn.AdaptiveAvgPool3d(1)(res5)
    x = x.view(-1, x.size(1))

    return x, lateral
    
# lateral connections:     
self.lateral_p1 = nn.Conv3d(8, 8*2, kernel_size=(5, 1, 1), stride=(8, 1 ,1), bias=False, padding=(2, 0, 0))
self.lateral_res2 = nn.Conv3d(32,32*2, kernel_size=(5, 1, 1), stride=(8, 1 ,1), bias=False, padding=(2, 0, 0))
self.lateral_res3 = nn.Conv3d(64,64*2, kernel_size=(5, 1, 1), stride=(8, 1 ,1), bias=False, padding=(2, 0, 0))
self.lateral_res4 = nn.Conv3d(128,128*2, kernel_size=(5, 1, 1), stride=(8, 1 ,1), bias=False, padding=(2, 0, 0))

#https://github.com/r1ch88/SlowFastNetworks/blob/master/lib/slowfastnet.py

在这里插入图片描述

Experiments

  1. Kinetics-400,600 SOTA

在这里插入图片描述第一个SlowFast 4x16,R50 相当轻量了

  1. Charades SOTA

在这里插入图片描述

  1. Fastpath can improve the performance of slowpath in at a small cost.

在这里插入图片描述

English Expression

  1. Motion is the spatiotemporal counterpart of orientation [2], but all spatiotemporal orientations are not equally likely.

  2. For example, if we see a moving edge in isolation, we perceive it as moving perpendicular to itself, even though in principle it could also have an arbitrary component of movement tangential to itself (the aperture problem in optical flow). 单独观察边缘的运动,可以看作相对于边缘自身的的两个垂直方向的运动,但原则上,应该是相对于自身任意方向的运动(光流中的孔径问题)

    不能通过单个算子(计算某像素值变化,例如梯度)预估运动方向,因为每一个算子只处理它负责局部区域的像素值变化

    在这里插入图片描述

    动图解释:http://elvers.us/perception/aperture/

  3. Slow motions are more likely(”易于观察“指确实可以察觉的倾向,也就是总结出来的规律) than fast motions (indeed most of the world we see is at rest at a given moment) and this has been exploited in Bayesian accounts of
    how humans perceive motion stimuli [58] (2002年的nature,引用到这篇文献也是厉害

  4. For concreteness, let us study this in the context of recognition.

  5. fast refreshing frames (high temporal resolution) 高帧率(高时序分辨率,时序密集)

Advantages and Drawbacks

  1. 英文表达很好,类比生物学,方法简单直接,”大道至简“
  2. 总体计算复杂度挺高,Fast计算复杂度低,与Slowpath互补;slow分支设计大网络充分挖掘时空特征(空间特征占主要贡献,可以看到slow only时,即使抽去了大部分帧,结果仍然不错),对结果占主要贡献,fast分支设计小网络没必要对空间特征细致挖掘,观察高帧率动作,抽取主要抽取空间特征;以上分析可以得出结论:仍然有很多时间特征没有被挖掘出来,视频是一种是时序上高度冗余的数据,ps. end to end 中 是否可以嵌入时域上的稀疏表达?时域上的shift操作可以看作是一种稀疏表达?数据预处理的时候,可不可以采用压缩感知or量化…
  3. 剥离实验看出,fast对spatial特征不敏感,以较小代价提高了slow的精度;
  4. 论文留了很多空间给后人去填啊~ nice
  5. 近期最喜欢的一篇文章,全部3D卷积的结构下,通过网络设计大大降低了总体复杂度;SlowPath重型(大通道,小时间卷short-range),FastPath轻量(小通道,长时间卷long-range)
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值