1.Contribution
- propose a two-stream ConvNet architecture
- spatial & tmporal
- ConvNet trained on multi-frame dense optical flow is able to achieve very good performance
- multi-task learn
- can increase the amount of training data
- can improve the performance on both
2.Two-Stream
- spatial stream
- action recognition from still video frames
- temporal stream
- recognize action from motion in the form of dense optical flow
- based on two pathway of human vision
- ventral stream
- performs object recognition
- dorsal stream
- recognize motion
- ventral stream
3.Video
- spatial
- in the form of individual frame appearance
- carry information about scenes and objects depicted in the video
- temporal
- in the form of motion across the frames
- conveys the movement of the observer (the camera) and the objects
4.Spatial Stream ConvNet
- operates on individual video frames
- effectively performing action recognition from still imagew
- some actions are strongly associated with particular objects
- an image classification architecture
5.Optical Flow ConvNets
- input
- formed by stacking optical flow displacement fields between several consecutive frames
- explicitly describes the motion between video frame
- make the recognition easier
- the network does not need to estimate motion
- make the recognition easier
6.Mean Flow Subtraction
- from each displacement field d we subtract its mean vector
7.Architecture
- sample a 224×224×2L sub-volume from I and pass it to the net as input
- hidden layer configuration same as the spatial net
- testing is similar to the spatial ConvNet
8.Optical Flow Stacking
- a dense optical flow can be seen as a set of displacement vector fields dt between the pairs of consecutive t & t+1
- dt(u,v)
-the displacement vector at the point(u,v) in the frame t1 which moves the point to the corresponding point in the following frame t+1 - dtx & dty
- the victor field of horizontal and vertical
- well suited to recognition using a convolutional network
- w,h
- width and height frames of a video
- IT(u,v,2k-1) = dxT+k-1(u.v)
IT(u,v,2k) = dxT+k-1(u.v)- u=[1;w] v=[1,h] h=[1;L]
- a ConvNet input volume IT∈Rw×h×2L for an arbitary frame T
- 2L means input channel
- the channel IT(u,v,c) store the displacement vector at the location(u,v)
- for an arbitary point (u,v),the channels IT(u,v,c) encode the motion at the point over a sequence of L frames
- c=[i;2L]
9.Trajectory Stacking
- sample along the motion trajectory
- IT(u,v,2k-1) = dxT+k-1(Pk)
IT(u,v,2k) = dyT+k-1(Pk)- u=[1;w] v=[1;h] k=[1;L]
- the input volume IT1 corresponding to a frame T
- Pk is the k-th point along the trajectory
- start at the location(u,v) in the frame T
- defined by the following recurrence relation
- p1=(u,v)
pk-1+dT+k-2(Pk-1) (k>1)- IT stores the vectors sampled at the locations Pk along the trajectory
- p1=(u,v)
10.Bi-directional Optical Flow
- compute an additional set of displacement fields in the opposite direction
- construct an input volume It by stacking L/2 forward flows between frame T and T+L/2 and L/2 backward flows between frames T-L/2 and T
- the flow can be represented using either of the methods (1) and (2)
11.Relation of The Temporal ConvNet Architecture to Previous Representation
- motion is explicitly represented using the optical flow displacement field computed based on the assumption of the intensity and smoothness of the flow
12.Visualisation of Learnt Convolutional Filters
- first-layer convolutional filters learnt on 10 stacked optical flows
- the visualisation is split into 96 columns and 20 rows
- each column corresponds to a filter
- each row corresponds to an input channel
- each of the 96 filters has a spatial receptive field of 7×7 pixels,and spans components of 10 stacked optical flow displacement fields d
- some filters compute spatial derivatives of the optical flow
- capture how motion changes with image location
- capture which generalises derivative-based hand-crafted descriptors
- e.g. MBH
- other filters compute temporal derivatives
- capture changes in motion over time
13.Multi-task Learning
- combine several dataset
- aim to learn a (video) representation not only be applicable to the task in question (e.g. HMDB-51 classification),but also to other tasks (e.g.UCF-101 classification)
- additional tasks act as a regulariser , and allow for the exploitation of additional training data
- in our case, ConvNet architecture has two softmax classification layer on top of the last fully-connecter layer
- compute HMDB-H classification
- compute UCF-101 scores
- each of the layers is equipped with its own loss function
- the overall training loss is computer as the sum of the individual task’s losses
- the network weight derivatives can be found by back-propagation
14.Implementation details
- ConvNets configuration
- all hidden weight layers use the rectification[ReLu] activation function
- max pooling is performed over 3×3 spatial windows with stride 2
- local response normalisation uses the same setting as 《ImageNet Classification with Deep Convolutional Neural Networks》
- different between spatial and temporal ConvNet configuration
- remove the second nomalisation layer from the latter to reduce memory consumption
- training
- spatial net training
- a 224×224 sub-image is randomly cropped from the selecte frame, then undergoes random horizontal flipping and RGB jittering
- videos are rescaled beforehand
- the sub-image is sampled from the whole frame
- temporal net training
- compute an optical flow volume I for the selected training frame form I, a fixed-size 224×224×2L input is randomly cropped and flipped
- learning rate
- intially set to 10-2, the decrease accroding to a fixed schedule, which is kept the same for all training set
- change to 10-3 after 50k iterations, training stop after 80k iterations
- in fine-tuning, the rate is changed to 10-3 after 14k iterations,stop after 20k iterations
- testing
- sample a fixed number of frames(25) with equal temporal spacing between them
- get 10 ConvNet inputs from each of the frames by cropping and flipping four corners and the center of the frame
- class scores for the whole video are the obtained by averaging the scores across the sampled frames and crops therein
- pre-training on ImageNet ILSVRC-2012
- pre-train the spatial ConvNet
- use the same training and test data augmentation[cropping,flipping,RGB jittering]
- sample from the whole image
- Multi-GPU training
- derived from the Caffe, contain a lot of modification including parallel training on Multiple GPUs installed in a single system
- exploit data parallelism ,and split each SGD Batch across several GPUs
- 3.2 times speed up
- optical flow
- using the off-the-shelf GPU implementation of 《High accuracy optical flow estimation based on a theory for warping》from the OpenCV toolbox
- do pre-compute the flow before training
- the horizontal and vertical components of the flow were linearly rescaled to a [0,225] range and compressed use JPEG to avoid storing the displacement field as float and can reduce the flow size for the UCF-101 dataset from 1.5TB to 27GB
- spatial net training
15.Evalution
- datasets and evalution protocol
- performed on UCF-101 and HMDB-51
- UCF-101 contains 13k video
- HMDB-51 contains 6.8k video of 51 actions
- performed on UCF-101 and HMDB-51
- evalution protocol
- the organisers provide 3 splits into training and testing data
- the performance is measure by the mean classification accuracy across the splits
- UCF-101 contains 9.5k training video
- HMDB-51 contains 3.7k training video
- we begin by comparing different architectures on the first split of CUF-101 dataset
- follow the standard evalution protocol & report the average accuracy over three splits on both UCF-101 & HMDB-51
- spatial ConvNet
- measure the performance of the spatial stream ConvNet
- choose training the last layer on top of a pre-trained ConvNet
- temporal Convnet
- particular measure the effect of
- using multiple[L={5,10}] stacked optical flows
- trajectory stacking
- mean displacement substraction
- using the Bi-directional optical flow
- use an aggressive dropout ratio of 0.9 to help improve generalisation
- results
- stacking multiple(L>1) displacement in the field in the input is highly beneficial
- it provides the network with long-term motion information
- mean subtraction is helpful
- reduce the effect of global motion between the frames
- temporal ConvNet significantly outperform the spatial ConvNet
- confirms the importance of motion information for action recognition
- implement the “slow fusion” architecture of 《Large-scale video classifi-
cation with convolutional neural networks》- amounts to applying a ConvNet to a stack of RGB frames
- while multi-frame information is important, it is also important to present it to a ConvNet in an appropriate manner
- multi-task learning of temporal ConvNets
- train the ConvNet on HMDB-51 is different than on UCF-101
- multi-task learning performs the best
- it allows the training procedure to exploit all available training data
-two-stream ConvNet - we evaluate the complete two-stream model
- combines the two recognition streams
- fuse the softmax scores using either averaging or a linear SVM to combine the network
- it allows the training procedure to exploit all available training data
- conclude
- temporal and spatial recognition streams are complementary
- their fusion significantly improves on both
- 6% over temporal and 14% over spatial nets
- their fusion significantly improves on both
- SVM-based fusion of softmax scores outperforms fusion by averaging
- using Bi-directional flow is not beneficial in the case of ConvNet fusion
- temporal ConvNet trained using multi-task learning, performs the best both along and when fused with a spatial net
- temporal and spatial recognition streams are complementary
- stacking multiple(L>1) displacement in the field in the input is highly beneficial
- results
- particular measure the effect of
16.Comparison with the State of the Art
- both our spatial and temporal nets alone outperform the deep architecture of 《Large-scale video classification with convolutional neural networks》and 《A large video database for human motion recognition》by a large margin
- the combination of two nets
- further improves the results
- is comparable to very recent state-of-the-art hand-crafted models
- confusion matrix and per-class recall for UCF-101 classification
- worst class is Hammering confused with HeadMessage class and BrushingTeeth class
- reason
- spatial ConvNet confuses Hammering with Headmaessage, which can be caused by the significant presence of human faces in both class
- the temporal ConvNet confuses Hammering with BrushingTeeth as both actions contain recurring motion patterns
- hand moving up and down
17.Conclusion
- proposed a deep video classification model with competitive performance, which incorporates separate spatial and temporal recognition streams based on ConvNets
- training a temporal ConvNet on optical flow is significantly better than training on raw stacked frames
- our temporal model does not require significant hand-crafting , despite using optical flow as input
- since the flow is computed using a method based on the generic assumptions of constancy and smoothness
- extra training data poses a significant challenge on its own
-due to the gigantic amount of training data- multiple TBs
- essential ingredients of the state of art missed in our current architecture
- local feature pooling over spatio-temporal tubes, centered at the trajectories
- even though the input(2) captures the optical flow along the trajectories the spatial pooling in our network does not take the trajectories into account
- explicit handing of camera motion, which in our case is compensated by mean displacement subtraction
- local feature pooling over spatio-temporal tubes, centered at the trajectories