1.Aim
- discover the principle to design effective ConvNet architecture for action recognition in videos
- learn these models given limited training samples
2.Contribution
- TSN
- based on the idea of long-range temporal structure modeling
- combines a sparse temporal sampling strategy video-level supervision
- study on a series of good practices in learning ConvNet on video data with the help of temporal segment network
3.Action Recognition
- appearance & dynamics
4.Two Major obstacles of ConvNet in Video-based Action Recognition
- long-range temporal structure
- understand the dynamics in action video
- training deep ConvNet requires a large volume of training sample in practice
- limited dataset [size & diversity]
5.Study Two Problem
- how to design an effective and efficient video-level framework for learning video representation that is able to capture long-range temporal structure
- how to learn the ConvNet models given limited training samples
6.Action Recognition
- Convolutional network
- directly operated on a longer continuous video stream
- limited by computational cost
- these methods usually process sequences of fixed lengths ranging form 64 to 120 frames
- limited temporal coverage
- unable to assemble an end-to-end learning scheme for modeling the temporal structure
7.Temporal Segment Networks
- a video-level framework,to enable to model dynamics throughout the whole video
- aim to utilize the visual information of entire video to perform video-level prediction
- composed of spatial stream ConvNet and temporal stream ConvNets
- operate on a sequence of short snippets sparsely sampled from the entire video.Each snippet in this sequence will produce its own preliminary prediction of the action classes. Then a consensus among the snippets will be derived as
the video-level prediction - In the learning process, the loss values of video-level predictions, other than those of snippet-level predictions which were used in twostream ConvNets, are optimized by iteratively updating the model parameters
- TSN(T1, T2, · · · , TK) = H(G(F(T1;W), F(T2;W), · · · , F(TK;W)))
- (T1, T2, · · · , TK) is a sequence of snippets.
- each snippet TK is randomly sampled from its corresponding segment SK.
- F(TK;W) is the function representing a ConvNet with parameters W
- operates on the short snippet TK
- produces class scores for all the classes. The segmental consensus function
- G combines the outputs from multiple short snippets to obtain a consensus of class hypothesis among them.
- use evenly average
- the prediction function H predicts the probability of each action class for the whole video.
- here we choose the widely used Softmax function for H.
- K is set to 3
- (T1, T2, · · · , TK) is a sequence of snippets.
-
- C is the number of action classes
- yi the groundtruth label concerning class i
- back-propagation process
- by fixing K for all videos, we assemble a sparse temporal sampling
strategy- reduces the computational cost for evaluating ConvNets on the
frames
- reduces the computational cost for evaluating ConvNets on the
8.Learning Temporal Segment Network
- Network Architecture
- Inception with Batch Normalization (BN-Inception)
- due to its good balance between accuracy and efficiency
- the spatial stream ConvNet operates on a single RGB images
- the temporal stream ConvNet takes a stack of consecutive optical flow fields as input
- Network Inputs
- the two-stream ConvNets used RGB images for the spatial stream & stacked optical flow fields for the temporal stream
- RGB difference & warped optical flow fields
- RGB difference between two consecutive frames describe the
appearance change - extract the warped optical flow by first estimating homography matrix and then compensating camera motion
- RGB difference between two consecutive frames describe the
- Network Training
- Cross Modality Pre-training
- we utilize RGB models to initialize the temporal networks
- discretize optical flow fields into the interval from 0 to 255 by a linear transformation
- makes the range of optical flow fields to be the same with RGB images
- modify the weights of first convolution layer of RGB models to handle the input of optical flow fields
- average the weights across the RGB channels and replicate this average by the channel number of temporal network input
- discretize optical flow fields into the interval from 0 to 255 by a linear transformation
- we utilize RGB models to initialize the temporal networks
- Cross Modality Pre-training
- Regularization Techniques
- freeze the mean and variance parameters of all Batch Normalization layers except the first one
- add a extra dropout layer after the global pooling layer in BN-Inception
architecture
- the dropout ratio is set as 0.8 for spatial stream ConvNets and 0.7 for temporal stream ConvNets
- Data Augmentation
- In the original two-stream ConvNets, random cropping and horizontal flipping are employed to augment training samples
- corner cropping
- extracted regions are only selected from the corners or the center of the image to avoid implicitly focusing on the center area of a image
- scale jittering
9.Testing Temporal Segment Networks
- sample 25 RGB frames or optical flow stacks from the action videos. Meanwhile, we crop 4 corners and 1 center, and their horizontal flipping from the sampled frames to evaluate the ConvNets
- For the fusion of spatial and temporal stream networks,we take a weighted average of them.
- When learned within the temporal segment network framework, the performance gap between spatial stream ConvNets and temporal stream ConvNets is much smaller than that in the original two-stream ConvNets
- weight of temporal stream is set to 1.5 for temporal stream
- weight of temporal stream is divided to 1 for optical flow
- weight of temporal stream is divided to 0.5 for warped flow
- weight of spatial stream is set to 1.5 for temporal stream
10.Datasets and Implementation Details
- follow the original evaluation scheme using three training/testing splits and report average accuracy over these splits
- use the mini-batch stochastic gradient descent algorithm to learn the network parameters
- batch size is set to 256
- momentum set to 0.9
- spatial networks
- the learning rate is initialized as 0.001
- decreases to its 1/10 every 2, 000 iterations
- the whole training procedure stops at 4, 500 iterations
- temporal networks
- the learning rate is initialized as 0.005
- reduces to its 1/10 after 12, 000 and 18, 000 iterations
- the maximum iteration is set as 20, 000
- data augmentation
- location jittering,
- horizontal flipping,
- corner cropping
- scale jittering
- extraction of optical flow and warped optical flow
- choose the TVL1 optical flow algorithm
- speed up training
- employ a data-parallel strategy with multiple GPUs
11.Exploration Study
- the optical flow is better at capturing motion information and sometimes RGB difference may be unstable for describing motions
- RGB difference may serve as a low-quality, high-speed alternative
for motion representations
12.Evaluation of Temporal Segment Networks
- choose average pooling as the default aggregation function
- BN-Inception as the ConvNet architecture for temporal segment networks
- modeling long-term temporal structures is crucial for better understanding of action in videos
13. Model Visualization
- learned models focus more on humans in the videos, and seem to be modeling the long-range structure of the action class
- models learned with the proposed method may perform better, which is well reflected in our quantitative experiments
14.Conclusion
- TSN
- a video-level framework that aims to model long-term temporal structur
- bring the state of the art to a new level
- segmental architecture with sparse sampling
- a series of good practices that we explored in this work