【video analysis in deep learning】

最新推荐文章于 2022-03-14 18:52:18 发布

儒雅的晴天

最新推荐文章于 2022-03-14 18:52:18 发布

阅读量356

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/weixin_39915444/article/details/82822640

版权

机器学习专栏收录该内容

15 篇文章 1 订阅

订阅专栏

图像数据集ImageNet

视频数据集: UCF-101

视频表示学习早期广泛使用的方法是手工特征的提取(Hand-Crafted feature)。

这类方法有着四大明显缺点：

对于相机运动和光照变化较为敏感
不包含高层语义信息
特征维度太高
计算太耗时

早期基于深度学习的视频表示是基于2D卷积神经网络（2D-CNN）

Paper1: Large-scale Video Classification with Convolutional Neural Networks

Dataset: The Sports-1M dataset consists of 1 million YouTube videos annotated with 487 classes.

An effective approach to speeding up the runtime performance of CNNs is to modify the architecture to contain two separate streams of processing: a context stream that learns features on low-resolution frames and a high-resolution fovea stream that only operates on the middle portion of the frame.

Red, green and blue boxes indicate convolutional, normalization and pooling layers respectively.

Multiresolution CNN architecture. Input frames are fed into two separate streams of processing: a context stream that models low-resolution image and a fovea stream that processes high-resolution center crop. Both streams consist of alternating convolution (red), normalization (green) and pooling (blue) layers. Both streams converge to two fully connected layers (yellow).

the context stream learns more color features while the high-resolution fovea stream learns high frequency grayscale filters.

Conclusion: 1. two separate streams of processing: a context stream that models low-resolution image and a fovea stream that processes high-resolution center crop.

2. Slow fusion

Paper2: Two-Stream Convolutional Networks for Action Recognition in Videos

The temporal part, in the form of motion across the frames, conveys the movement of the observer (the camera) and the objects.

the input to our model is formed by stacking optical flow displacement fields between several consecutive frames. Such input
explicitly describes the motion between video frames,

Conclusion: 将视频分帧送入第一个卷积神经网络进行训练来提取静态特征，同时将从视频中提取出的光流图送进另外一个卷积神经网络来提取动态特征。最终将两个网络softmax层输出的分值进行一个融合。

Paper3: Long-term Recurrent Convolutional Networks for Visual Recognitionand Description

Long-term Recurrent Convolutional Networks (LRCNs): combines convolutional layers and long-range temporal recursion and is end-to-end trainable.

儒雅的晴天

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【video analysis in deep learning】

图像数据集ImageNet视频数据集: UCF-101视频表示学习早期广泛使用的方法是手工特征的提取(Hand-Crafted feature)。这类方法有着四大明显缺点：对于相机运动和光照变化较为敏感不包含高层语义信息特征维度太高计算太耗时早期基于深度学习的视频表示是基于2D卷积神经网络（2D-CNN）Paper1:Large-...
复制链接

扫一扫

专栏目录