Temporal Segment Networks：Towards Good Practices for Deep Action Recognition[Paper Part]

最新推荐文章于 2023-09-06 20:49:38 发布

Eudemonia_mia

最新推荐文章于 2023-09-06 20:49:38 发布

阅读量507

点赞数

分类专栏： video recognition 文章标签： Deep Action Recogniton TSN

本文链接：https://blog.csdn.net/Eudemonia_mia/article/details/82956311

版权

video recognition 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1.Aim

discover the principle to design effective ConvNet architecture for action recognition in videos
learn these models given limited training samples

2.Contribution

TSN
- based on the idea of long-range temporal structure modeling
- combines a sparse temporal sampling strategy video-level supervision
study on a series of good practices in learning ConvNet on video data with the help of temporal segment network

3.Action Recognition

appearance & dynamics

4.Two Major obstacles of ConvNet in Video-based Action Recognition

long-range temporal structure
- understand the dynamics in action video
training deep ConvNet requires a large volume of training sample in practice
- limited dataset [size & diversity]

5.Study Two Problem

how to design an effective and efficient video-level framework for learning video representation that is able to capture long-range temporal structure
how to learn the ConvNet models given limited training samples

6.Action Recognition

Convolutional network
- directly operated on a longer continuous video stream
- limited by computational cost
  - these methods usually process sequences of fixed lengths ranging form 64 to 120 frames
- limited temporal coverage
  - unable to assemble an end-to-end learning scheme for modeling the temporal structure

7.Temporal Segment Networks

a video-level framework,to enable to model dynamics throughout the whole video
aim to utilize the visual information of entire video to perform video-level prediction
composed of spatial stream ConvNet and temporal stream ConvNets
operate on a sequence of short snippets sparsely sampled from the entire video.Each snippet in this sequence will produce its own preliminary prediction of the action classes. Then a consensus among the snippets will be derived as
the video-level prediction
In the learning process, the loss values of video-level predictions, other than those of snippet-level predictions which were used in twostream ConvNets, are optimized by iteratively updating the model parameters
TSN(T₁, T₂, · · · , T_K) = H(G(F(T₁;W), F(T₂;W), · · · , F(T_K;W)))
- (T₁, T₂, · · · , T_K) is a sequence of snippets.
  - each snippet T_K is randomly sampled from its corresponding segment S_K.
- F(T_K;W) is the function representing a ConvNet with parameters W
  - operates on the short snippet T_K
  - produces class scores for all the classes. The segmental consensus function
- G combines the outputs from multiple short snippets to obtain a consensus of class hypothesis among them.
  - use evenly average
- the prediction function H predicts the probability of each action class for the whole video.
- here we choose the widely used Softmax function for H.
- K is set to 3
- C is the number of action classes
- y_i the groundtruth label concerning class i
back-propagation process
by fixing K for all videos, we assemble a sparse temporal sampling
strategy
- reduces the computational cost for evaluating ConvNets on the
  frames

8.Learning Temporal Segment Network

Network Architecture
- Inception with Batch Normalization (BN-Inception)
- due to its good balance between accuracy and efficiency
- the spatial stream ConvNet operates on a single RGB images
- the temporal stream ConvNet takes a stack of consecutive optical flow fields as input
Network Inputs
- the two-stream ConvNets used RGB images for the spatial stream & stacked optical flow fields for the temporal stream
- RGB difference & warped optical flow fields
  - RGB difference between two consecutive frames describe the
    appearance change
  - extract the warped optical flow by first estimating homography matrix and then compensating camera motion
Network Training
- Cross Modality Pre-training
  - we utilize RGB models to initialize the temporal networks
    - discretize optical flow fields into the interval from 0 to 255 by a linear transformation
      - makes the range of optical flow fields to be the same with RGB images
    - modify the weights of first convolution layer of RGB models to handle the input of optical flow fields
      - average the weights across the RGB channels and replicate this average by the channel number of temporal network input
Regularization Techniques
- freeze the mean and variance parameters of all Batch Normalization layers except the first one
- add a extra dropout layer after the global pooling layer in BN-Inception
  architecture
  - the dropout ratio is set as 0.8 for spatial stream ConvNets and 0.7 for temporal stream ConvNets
Data Augmentation
- In the original two-stream ConvNets, random cropping and horizontal flipping are employed to augment training samples
- corner cropping
  - extracted regions are only selected from the corners or the center of the image to avoid implicitly focusing on the center area of a image
- scale jittering

9.Testing Temporal Segment Networks

sample 25 RGB frames or optical flow stacks from the action videos. Meanwhile, we crop 4 corners and 1 center, and their horizontal flipping from the sampled frames to evaluate the ConvNets
For the fusion of spatial and temporal stream networks,we take a weighted average of them.
When learned within the temporal segment network framework, the performance gap between spatial stream ConvNets and temporal stream ConvNets is much smaller than that in the original two-stream ConvNets
weight of temporal stream is set to 1.5 for temporal stream
- weight of temporal stream is divided to 1 for optical flow
- weight of temporal stream is divided to 0.5 for warped flow
weight of spatial stream is set to 1.5 for temporal stream

10.Datasets and Implementation Details

follow the original evaluation scheme using three training/testing splits and report average accuracy over these splits
use the mini-batch stochastic gradient descent algorithm to learn the network parameters
- batch size is set to 256
- momentum set to 0.9
spatial networks
- the learning rate is initialized as 0.001
- decreases to its 1/10 every 2, 000 iterations
- the whole training procedure stops at 4, 500 iterations
temporal networks
- the learning rate is initialized as 0.005
- reduces to its 1/10 after 12, 000 and 18, 000 iterations
- the maximum iteration is set as 20, 000
data augmentation
- location jittering,
- horizontal flipping,
- corner cropping
- scale jittering
extraction of optical flow and warped optical flow
- choose the TVL1 optical flow algorithm
speed up training
- employ a data-parallel strategy with multiple GPUs

11.Exploration Study

the optical flow is better at capturing motion information and sometimes RGB difference may be unstable for describing motions
RGB difference may serve as a low-quality, high-speed alternative
for motion representations

12.Evaluation of Temporal Segment Networks

choose average pooling as the default aggregation function
BN-Inception as the ConvNet architecture for temporal segment networks
modeling long-term temporal structures is crucial for better understanding of action in videos

13. Model Visualization

learned models focus more on humans in the videos, and seem to be modeling the long-range structure of the action class
models learned with the proposed method may perform better, which is well reflected in our quantitative experiments

14.Conclusion

TSN
- a video-level framework that aims to model long-term temporal structur
bring the state of the art to a new level
- segmental architecture with sparse sampling
- a series of good practices that we explored in this work