图像数据集ImageNet
视频数据集: UCF-101
视频表示学习早期广泛使用的方法是手工特征的提取(Hand-Crafted feature)。
这类方法有着四大明显缺点:
-
对于相机运动和光照变化较为敏感
-
不包含高层语义信息
-
特征维度太高
-
计算太耗时
早期基于深度学习的视频表示是基于2D卷积神经网络(2D-CNN)
Paper1: Large-scale Video Classification with Convolutional Neural Networks
Dataset: The Sports-1M dataset consists of 1 million YouTube videos annotated with 487 classes.
An effective approach to speeding up the runtime performance of CNNs is to modify the architecture to contain two separate streams of processing: a context stream that learns features on low-resolution frames and a high-resolution fovea stream that only operates on the middle portion of the frame.
Red, green and blue boxes indicate convolutional, normalization and pooling layers respectively.
Multiresolution CNN architecture. Input frames are fed into two separate streams of processing: a context stream that models low-resolution image and a fovea stream that processes high-resolution center crop. Both streams consist of alternating convolution (red), normalization (green) and pooling (blue) layers. Both streams converge to two fully connected layers (yellow).
the context stream learns more color features while the high-resolution fovea stream learns high frequency grayscale filters.
Conclusion: 1. two separate streams of processing: a context stream that models low-resolution image and a fovea stream that processes high-resolution center crop.
2. Slow fusion
Paper2: Two-Stream Convolutional Networks for Action Recognition in Videos
The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects.
the input to our model is formed by stacking optical flow displacement fields between several consecutive frames. Such input
explicitly describes the motion between video frames,
Conclusion: 将视频分帧送入第一个卷积神经网络进行训练来提取静态特征,同时将从视频中提取出的光流图送进另外一个卷积神经网络来提取动态特征。最终将两个网络softmax层输出的分值进行一个融合。
Paper3: Long-term Recurrent Convolutional Networks for Visual Recognitionand Description
Long-term Recurrent Convolutional Networks (LRCNs): combines convolutional layers and long-range temporal recursion and is end-to-end trainable.