Video Saliency Detection 文献学习（一）Unified Image and Video Saliency Modeling

最新推荐文章于 2024-06-22 09:40:22 发布

老光头_ME2CS

最新推荐文章于 2024-06-22 09:40:22 发布

阅读量1.3k

点赞数 5

分类专栏： Saliency Detection Pytorch 文章标签：计算机视觉

本文链接：https://blog.csdn.net/Forrest97/article/details/107973552

版权

Pytorch 同时被 2 个专栏收录

14 篇文章 1 订阅

订阅专栏

Saliency Detection

2 篇文章 0 订阅

订阅专栏

由于课题有关人员的Attention/Gaze/Saliency建模，所以最近读了一些Saliency Detection 的文章。利用博客一方面作为学习笔记记录一下，一方面也与大家进行交流。
第一篇文章是2020年ECCV， University of Oxford研究团队的Unified Image and Video Saliency Modeling。这篇文章最大的特点就是集成了静态图像和动态视频的Saliency Detection功能，相比传统以静态图Saliency Detection的连续输出预测视频，有明显的创新之处。

论文

Abstract

开篇直接说明现在建模的gap，图像和视频模型独立分割的状况，发出疑问能否融合有点进行统一建模

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature.
Can image and video saliency modeling be approached via a unified model, with mutual benefit?

模型训练和测试使用多个公开的图像和视频 Saliency 数据集

图像数据集	视频数据集
DHF1K, Hollywood-2 and UCF-Sports,	SALICON and MIT300

建模中引入了Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain Adaptive Smoothing and Bypass-RNN四个创新的方法。
结果：功能上可以通过参数控制实现预测模式的切换，模型参数量上实现了轻量化，且对比之前模型精度有所提升（必须打上666）。

Introduction

第一段随便介绍saliency prediction/modeling；
第二段引入动态视频数据集和模型；
第三段介绍静态动态建模分割，尤其是一些需要输入光流和固定帧数画面的网络，因此作者发问了

Is it possible to model static and dynamic saliency via one unified framework, with mutual benefit?

第四段，作者想提出图像和视频领域相互切换的模型the domain shift between image and video saliency data，即具有领域自适应技术domain adaption techniques特性的UNISAL neural network architecture（用词真高级）。使用DHF1K, Hollywood-2 and UCF-Sports和SALICON数据集进行网络训练。
第五段，使用上述四个训练集的测试集合部分以及MIT300发现，Unified Image and Video Saliency Modeling性能卓越，outperforms current state-of-the-art methods on all video saliency datasets and achieves competitive performance for image saliency prediction。
第五段，总结主要贡献点：

第一个提出unified saliency detection模型框架
提出了四项domain adaption techniques实现了不同任务的特征共享
相比现有模型参数量有5-20倍的减少

Related Work

Image Saliency Modeling

Saliency Modeling历史介绍，从Itti教授的bottom-up模型介绍到深度学习top-down，简单介绍了近几年的几篇文章。引入到动态视频

Video Saliency Modeling

几类传统方法：low-level visual statistics, with additional temporal features (e.g., optical flow);the center-surround saliency in static scenes。局限性，limited by the representation ability of the low-level features for temporal information.
深度学习方法：via a multi-stream convolutional long short-term memory network (ConvLSTM)；
attention mechanism with ConvLSTM；3D convolutions。局限于resulting in limited applicability to static scenes

Spatio-Temporal Visual Prediction

之前用于图像空间和时间特征提取的方法，spatio-temporal motion patterns，the spatio-temporal domain by using LSTMs，the phase spectrum of the Quaternion Fourier Transform.限制于rendering the models unable to simultaneously model image saliency

Unified Image and Video Saliency Modeling

进入正题如何建模

Domain-Shift Modeling

首先对DHF1K, Hollywood-2 and UCF-Sports和SALICON数据集进行了域偏差分析the domain shift analyses。
处理的方法：随机从四个数据集合中抽取256帧画面，输入到pre-trained MobileNetV2网络中得到输出的特征矢量后，使用t-SNE算法降维可视化。

Domain-Adaptive Batch Normalization

为了减少数据集合之间的的均值方差，有利于神经网络的训练，一个标准化的操作流程。
Batch normalization (BN) aims to reduce the internal covariate shift of neural network activations by transforming their distribution to zero mean and unit variance for each training batch.
采用两种策略进行Normalization：

domain-invariant： all samples 所有样本的均值方差
domain-adaptive： the samples from the respective dataset ，分别进行标准化

如下图对比可见，经过标准化处理后的数据集在聚类后明显聚类现象得到减缓。因此模型针对每个数据集合建立了对应的BM modules.
在这里插入图片描述

Domain-Adaptive Priors

不同数据集和的注视区域中央聚集偏差strongest center bias程度不同，但 Hollywood-2 and UCF 数据集合表现十分明显，但图像数据集SALICON的散度又十分显著，这可能是由于图像数据集每张图5s的观察时间所决定。因此提出learn a separate set of Gaussian prior maps for each dataset.
在这里插入图片描述

之前的试验结果发现，车速越高，注视区域越小
危险场景注视区域分布更广

Domain-Adaptive Fusion

不同数据间的人员采集状态不相同
Hollywood-2 and UCF Sports datasets are task-driven, the viewer is instructed to identify the main action shown.The DHF1K dataset contains free-viewing fixations.

这个跟下面的讨论有什么关系？
回答不同的数据集是否需要不同的Fussion layer(1X1 convolution)进行特征层融合

方法: 搭建一个简单的saliency modeling，使用MNet V2提取特征后，使用一个Fusion layer (1X1 convolution)进行特征层提取，对比两种建模方法的性能差异。

domain-invariant： same weights for all datasets，所有数据集采用相同的权重系数。
domain-adaptive： different weights for each dataset，不同数据选择不同的权重系数。

结果如下，明显domain-adaptive性能更为突出。
在这里插入图片描述

Domain-Adaptive Smoothing

不同数据集的模糊滤波方法不相同blurring filter，将数据集resize到网络输入后，计算数据得锐度sharpnnes分布如下，可见存在较大差异。尤其是DHF1K有最大的锐度sharpness（maximum gradient）。
因此提出了针对数据集建立smoothing kernel blur the network output with a different learned Smoothing kernel for each dataset.

The upsampling is followed by a Domain- Adaptive Smoothing layer with 41x 41 convolutional kernels that explicitly model the dataset-dependent blurring of the ground-truth saliency maps.

在这里插入图片描述

UNISAL Network Architecture

总体是一个encoder-RNN-decoder框架结构
在这里插入图片描述

Encoder Network

主干选用了MobileNet-V2三点原因：

由于网络参数相对较少，增大网络batch和视频输入帧
提高模型inference效率，实现实时监控
避免over fitting

由于采用的是ImageNet-pretrained parameters网络输入标准化格式固定，Domain-Adaptive Batch Normalization并未被使用。

Gaussian Prior Maps

domain-adaptive Gaussian prior maps计算公式如下

$g^{(i)}(x, y)=\gamma \exp \left(-\frac{\left(x-\mu_{x}^{(i)}\right)^{2}}{\left(\sigma_{x}^{(i)}\right)^{2}}-\frac{\left(y-\mu_{y}^{(i)}\right)^{2}}{\left(\sigma_{y}^{(i)}\right)^{2}}\right)$
参数：gama= 6，由MNetV2 中RELU6决定。 we propose unconstrained Gaussianprior maps ，instead of drawing the initial Gaussian parameters from a normal distribution, which results in highly correlated maps
从b图可以看出x, y成正太分布。Gaussian Prior Maps放在了CNN和RNN之间，to model the static center bias；in order to leverage the prior maps in higher-level features

c图没有看懂
原文instead of drawing the initial Gaussian parameters from a normal distribution, which results in highly correlated maps, we initialize NG = 16 maps，避免受一个初始分布的先验影响，提出了c图中16组不同的初始滤波组合。

在这里插入图片描述

Bypass-RNN

之前提取空间特征的方法·RNN, optical flow or 3D convolutions都不适用于静态图。提出了Bypass-RNN模块，残差连接的方式选择提取图像特征。

RNN whose output is added to its input features via a residual connection that is automatically omitted (bypassed) for static batches during training and inference

使用convolutional GRU (cGRU) RNN构建Bypass-RNN模块

Decoder Network and Smoothing

Decoder的细节如下，为降低参数量使用了比较少多的深度可分离卷积 depthwise separable convolution以及逐点卷积pointwise 1x1 convolution。
The Post-US2 features are reduced to a single channel by an Domain-Adaptive Fusion layer (1x1 convolution) and upsampled to the input resolution via nearest-neighbor interpolation.
The upsampling is followed by a Domain- Adaptive Smoothing layer with 41x41 convolutional kernels that explicitly model the dataset-dependent blurring of the ground-truth saliency maps.
在这里插入图片描述

Domain-Aware Optimization

每个数据集合图像的长宽比不同，输入网络经过Domain-Adaptive Batch Normalization，可以实现特定网络特定尺寸输入。

MobileNet V2输入尺寸不是固定的吗

The images/frames of the training datasets each have different aspect ratios, specifically 4:3 for SALICON, 16:9 for DHF1K, and 1.85:1 (median) for Hollywood-2, and 3:2 (median) for UCF Sports.
we use input resolutions of 288384, 224384, 224416 and 256384 for SALICON, DHF1K,
Hollywood-2 and UCF Sports, respectively.
不同视频数据集合帧率不同，分别隔5和4取一帧画面，保证网络画面输入6HZ
The frame rate of the DHF1K videos is 30 fps compared to 24 fps for Hollywood-2 and UCF Sports.
In order to assimilate the frame rates during training, and to train on longer time intervals, we construct clips using every 5th frame for DHF1K and every 4th frame for all others, yielding 6 fps overall.

Experiments

Experimental Setup

数据集合的划分
For SALICON, we use the official training/validation/testing split of 10,000/5,000/5,000.
For Hollywood-2 and UCF Sports,we use the training and testing splits of 823/884 and 103/47 videos, and the corresponding validation sets are randomly sampled 10% from the training sets, following.
For Hollywood-2, the videos are divided into individual shots.
For the DHF1K dataset, we use the official training/validation/testing splits of 600/100/300 videos.
Stochastic Gradient Descent with momentum of 0.9 and weight decay of 10􀀀4. Gradients are clipped to 2.
The learning rate is set to 0.04 and exponentially decayed by a factor of 0.8 after each epoch.
The batch size is set to 4 for video data and 32 for SALICON.
The video clip length is set to 12 frames that are sampled as described in Section 3.3
To prevent overfitting, the weights of MNet V2 are frozen for the first two epochs and afterwards trained with a learning rate that is reduced by a factor of 10. The pretrained BN statistics of MNet V2 are frozen throughout training

目标尺寸是的原始图像尺寸

数据集	类型	图像尺寸	帧率	网络输入	网络输出	目标尺寸
DHF1K	Video	640x360	30	384x224	384x224	640x360
Hollywood	Video	640x304	24	416x224	416x224	640x304
UCFSports	Video	720x404	24	384x256	384x256	720x404
SALICON	Image	640X480	/	384x288	384x288	720x404

注意：Google的系列卷积网络（包括MobileNetV2）输入格式都是224x224，为保证与训练网络的权重能很好的迁移到现有网络中，除需要保证原有图像的长宽比，还要保证图像尺寸的匹配。
由于原文中也没有细致解释图像分辨率（resolutions）的定义方法，个人推测网络的输入输出最小为尺寸设置为224，对应于16：9的图像长就是384，以这两个数为基准，确定不同数据集的图像尺寸的定义。
对于视频网络一次输入12frames，即过去2s的画面输入（对于30fps的视频，每5帧取1帧，2s12frames；对于24fps的视频，每4帧取1帧，2s12frames）