Two-Stream Convolutional Networks for Action Recognition in Videos [Paper Part]

最新推荐文章于 2022-01-29 21:48:43 发布

Eudemonia_mia

最新推荐文章于 2022-01-29 21:48:43 发布

阅读量429

点赞数

分类专栏： video recognition 文章标签： video recognition paper summary

本文链接：https://blog.csdn.net/Eudemonia_mia/article/details/82812028

版权

video recognition 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1.Contribution

propose a two-stream ConvNet architecture
- spatial & tmporal
ConvNet trained on multi-frame dense optical flow is able to achieve very good performance
multi-task learn
- can increase the amount of training data
- can improve the performance on both

2.Two-Stream

spatial stream
- action recognition from still video frames
temporal stream
- recognize action from motion in the form of dense optical flow
based on two pathway of human vision
- ventral stream
  - performs object recognition
- dorsal stream
  - recognize motion

3.Video

spatial
- in the form of individual frame appearance
- carry information about scenes and objects depicted in the video
temporal
- in the form of motion across the frames
- conveys the movement of the observer (the camera) and the objects

4.Spatial Stream ConvNet

operates on individual video frames
- effectively performing action recognition from still imagew
some actions are strongly associated with particular objects
- an image classification architecture

5.Optical Flow ConvNets

input
- formed by stacking optical flow displacement fields between several consecutive frames
- explicitly describes the motion between video frame
  - make the recognition easier
    - the network does not need to estimate motion

6.Mean Flow Subtraction

from each displacement field d we subtract its mean vector

7.Architecture

sample a 224×224×2L sub-volume from I and pass it to the net as input
hidden layer configuration same as the spatial net
testing is similar to the spatial ConvNet

8.Optical Flow Stacking

a dense optical flow can be seen as a set of displacement vector fields dt between the pairs of consecutive t & t+1
d_t(u,v)
-the displacement vector at the point(u,v) in the frame t1 which moves the point to the corresponding point in the following frame t+1
d_t^x & d_t^y
- the victor field of horizontal and vertical
- well suited to recognition using a convolutional network
w,h
- width and height frames of a video
I_T(u,v,2k-1) = d^x_T+k-1(u.v)
I_T(u,v,2k) = d^x_T+k-1(u.v)
- u=[1;w] v=[1,h] h=[1;L]
- a ConvNet input volume I_T∈R^{w×h×2L for an arbitary frame T}
  - 2L means input channel
- the channel I_T(u,v,c) store the displacement vector at the location(u,v)
for an arbitary point (u,v)，the channels I_T(u,v,c) encode the motion at the point over a sequence of L frames
- c=[i;2L]

9.Trajectory Stacking

sample along the motion trajectory
I_T(u,v,2k-1) = d^x_T+k-1(P_k)
I_T(u,v,2k) = d^y_T+k-1(P_k)
- u=[1;w] v=[1;h] k=[1;L]
- the input volume I_T₁ corresponding to a frame T
- P_k is the k-th point along the trajectory
  - start at the location(u,v) in the frame T
  - defined by the following recurrence relation
    - p₁=(u,v)
      p_k-1+d_T+k-2(P_k-1) (k>1)
      - I_T stores the vectors sampled at the locations P_k along the trajectory

10.Bi-directional Optical Flow

compute an additional set of displacement fields in the opposite direction
construct an input volume I_t by stacking L/2 forward flows between frame T and T+L/2 and L/2 backward flows between frames T-L/2 and T
the flow can be represented using either of the methods (1) and (2)

11.Relation of The Temporal ConvNet Architecture to Previous Representation

motion is explicitly represented using the optical flow displacement field computed based on the assumption of the intensity and smoothness of the flow

12.Visualisation of Learnt Convolutional Filters

first-layer convolutional filters learnt on 10 stacked optical flows
the visualisation is split into 96 columns and 20 rows
- each column corresponds to a filter
- each row corresponds to an input channel
each of the 96 filters has a spatial receptive field of 7×7 pixels,and spans components of 10 stacked optical flow displacement fields d
some filters compute spatial derivatives of the optical flow
- capture how motion changes with image location
- capture which generalises derivative-based hand-crafted descriptors
  - e.g. MBH
- other filters compute temporal derivatives
  - capture changes in motion over time

13.Multi-task Learning

combine several dataset
aim to learn a (video) representation not only be applicable to the task in question (e.g. HMDB-51 classification),but also to other tasks (e.g.UCF-101 classification)
- additional tasks act as a regulariser , and allow for the exploitation of additional training data
in our case, ConvNet architecture has two softmax classification layer on top of the last fully-connecter layer
- compute HMDB-H classification
- compute UCF-101 scores
- each of the layers is equipped with its own loss function
- the overall training loss is computer as the sum of the individual task’s losses
- the network weight derivatives can be found by back-propagation

14.Implementation details

ConvNets configuration
- all hidden weight layers use the rectification[ReLu] activation function
- max pooling is performed over 3×3 spatial windows with stride 2
- local response normalisation uses the same setting as 《ImageNet Classification with Deep Convolutional Neural Networks》
- different between spatial and temporal ConvNet configuration
  - remove the second nomalisation layer from the latter to reduce memory consumption
training
- spatial net training
  - a 224×224 sub-image is randomly cropped from the selecte frame, then undergoes random horizontal flipping and RGB jittering
  - videos are rescaled beforehand
  - the sub-image is sampled from the whole frame
- temporal net training
  - compute an optical flow volume I for the selected training frame form I, a fixed-size 224×224×2L input is randomly cropped and flipped
- learning rate
  - intially set to 10^-2, the decrease accroding to a fixed schedule, which is kept the same for all training set
  - change to 10^-3 after 50k iterations, training stop after 80k iterations
  - in fine-tuning, the rate is changed to 10^-3 after 14k iterations,stop after 20k iterations
- testing
  - sample a fixed number of frames(25) with equal temporal spacing between them
  - get 10 ConvNet inputs from each of the frames by cropping and flipping four corners and the center of the frame
  - class scores for the whole video are the obtained by averaging the scores across the sampled frames and crops therein
- pre-training on ImageNet ILSVRC-2012
  - pre-train the spatial ConvNet
  - use the same training and test data augmentation[cropping,flipping,RGB jittering]
  - sample from the whole image
- Multi-GPU training
  - derived from the Caffe, contain a lot of modification including parallel training on Multiple GPUs installed in a single system
  - exploit data parallelism ,and split each SGD Batch across several GPUs
    - 3.2 times speed up
- optical flow
  - using the off-the-shelf GPU implementation of 《High accuracy optical flow estimation based on a theory for warping》from the OpenCV toolbox
  - do pre-compute the flow before training
  - the horizontal and vertical components of the flow were linearly rescaled to a [0,225] range and compressed use JPEG to avoid storing the displacement field as float and can reduce the flow size for the UCF-101 dataset from 1.5TB to 27GB

15.Evalution

datasets and evalution protocol
- performed on UCF-101 and HMDB-51
  - UCF-101 contains 13k video
  - HMDB-51 contains 6.8k video of 51 actions
evalution protocol
- the organisers provide 3 splits into training and testing data
- the performance is measure by the mean classification accuracy across the splits
- UCF-101 contains 9.5k training video
- HMDB-51 contains 3.7k training video
- we begin by comparing different architectures on the first split of CUF-101 dataset
- follow the standard evalution protocol & report the average accuracy over three splits on both UCF-101 & HMDB-51
spatial ConvNet
- measure the performance of the spatial stream ConvNet
- choose training the last layer on top of a pre-trained ConvNet
temporal Convnet
- particular measure the effect of
  - using multiple[L={5,10}] stacked optical flows
  - trajectory stacking
  - mean displacement substraction
  - using the Bi-directional optical flow
- use an aggressive dropout ratio of 0.9 to help improve generalisation
  - results
    - stacking multiple(L>1) displacement in the field in the input is highly beneficial
      - it provides the network with long-term motion information
    - mean subtraction is helpful
      - reduce the effect of global motion between the frames
    - temporal ConvNet significantly outperform the spatial ConvNet
      - confirms the importance of motion information for action recognition
    - implement the “slow fusion” architecture of 《Large-scale video classifi-
      cation with convolutional neural networks》
      - amounts to applying a ConvNet to a stack of RGB frames
    - while multi-frame information is important, it is also important to present it to a ConvNet in an appropriate manner
    - multi-task learning of temporal ConvNets
      - train the ConvNet on HMDB-51 is different than on UCF-101
      - multi-task learning performs the best
        it allows the training procedure to exploit all available training data
        -two-stream ConvNet
        we evaluate the complete two-stream model
        combines the two recognition streams
        
        fuse the softmax scores using either averaging or a linear SVM to combine the network
      - conclude
        temporal and spatial recognition streams are complementary
        their fusion significantly improves on both
        6% over temporal and 14% over spatial nets
        
        SVM-based fusion of softmax scores outperforms fusion by averaging
        using Bi-directional flow is not beneficial in the case of ConvNet fusion
        temporal ConvNet trained using multi-task learning, performs the best both along and when fused with a spatial net

16.Comparison with the State of the Art

both our spatial and temporal nets alone outperform the deep architecture of 《Large-scale video classification with convolutional neural networks》and 《A large video database for human motion recognition》by a large margin
the combination of two nets
- further improves the results
- is comparable to very recent state-of-the-art hand-crafted models
confusion matrix and per-class recall for UCF-101 classification
- worst class is Hammering confused with HeadMessage class and BrushingTeeth class
- reason
  - spatial ConvNet confuses Hammering with Headmaessage, which can be caused by the significant presence of human faces in both class
  - the temporal ConvNet confuses Hammering with BrushingTeeth as both actions contain recurring motion patterns
    - hand moving up and down

17.Conclusion

proposed a deep video classification model with competitive performance, which incorporates separate spatial and temporal recognition streams based on ConvNets
- training a temporal ConvNet on optical flow is significantly better than training on raw stacked frames
- our temporal model does not require significant hand-crafting , despite using optical flow as input
  - since the flow is computed using a method based on the generic assumptions of constancy and smoothness
extra training data poses a significant challenge on its own
-due to the gigantic amount of training data
- multiple TBs
- essential ingredients of the state of art missed in our current architecture
  - local feature pooling over spatio-temporal tubes, centered at the trajectories
    - even though the input(2) captures the optical flow along the trajectories the spatial pooling in our network does not take the trajectories into account
    - explicit handing of camera motion, which in our case is compensated by mean displacement subtraction