由于课题有关人员的Attention/Gaze/Saliency建模,所以最近读了一些Saliency Detection 的文章。利用博客一方面作为学习笔记记录一下,一方面也与大家进行交流。
第一篇文章是2020年ECCV, University of Oxford研究团队的Unified Image and Video Saliency Modeling。这篇文章最大的特点就是集成了静态图像和动态视频的Saliency Detection功能,相比传统以静态图Saliency Detection的连续输出预测视频,有明显的创新之处。
文章目录
论文
Abstract
开篇直接说明现在建模的gap,图像和视频模型独立分割的状况,发出疑问能否融合有点进行统一建模
Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature.
Can image and video saliency modeling be approached via a unified model, with mutual benefit?
模型训练和测试使用多个公开的图像和视频 Saliency 数据集
图像数据集 | 视频数据集 |
---|---|
DHF1K, Hollywood-2 and UCF-Sports, | SALICON and MIT300 |
建模中引入了Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain Adaptive Smoothing and Bypass-RNN
四个创新的方法。
结果:功能上可以通过参数控制实现预测模式的切换,模型参数量上实现了轻量化,且对比之前模型精度有所提升(必须打上666)。
Introduction
第一段随便介绍saliency prediction/modeling;
第二段引入动态视频数据集和模型;
第三段介绍静态动态建模分割,尤其是一些需要输入光流和固定帧数画面的网络,因此作者发问了
Is it possible to model static and dynamic saliency via one unified framework, with mutual benefit?
第四段,作者想提出图像和视频领域相互切换的模型the domain shift between image and video saliency data,即具有领域自适应技术domain adaption techniques
特性的UNISAL neural network architecture(用词真高级)。使用DHF1K, Hollywood-2 and UCF-Sports和SALICON数据集进行网络训练。
第五段,使用上述四个训练集的测试集合部分以及MIT300发现,Unified Image and Video Saliency Modeling性能卓越,outperforms current state-of-the-art methods on all video saliency datasets and achieves competitive performance for image saliency prediction。
第五段,总结主要贡献点:
- 第一个提出unified saliency detection模型框架
- 提出了四项domain adaption techniques实现了不同任务的特征共享
- 相比现有模型参数量有5-20倍的减少
Related Work
Image Saliency Modeling
Saliency Modeling历史介绍,从Itti教授的bottom-up模型介绍到深度学习top-down,简单介绍了近几年的几篇文章。引入到动态视频
Video Saliency Modeling
几类传统方法:low-level visual statistics, with additional temporal features (e.g., optical flow);the center-surround saliency in static scenes。局限性,limited by the representation ability of the low-level features for temporal information.
深度学习方法:via a multi-stream convolutional long short-term memory network (ConvLSTM);
attention mechanism with ConvLSTM;3D convolutions。局限于resulting in limited applicability to static scenes
Spatio-Temporal Visual Prediction
之前用于图像空间和时间特征提取的方法,spatio-temporal motion patterns,the spatio-temporal domain by using LSTMs,the phase spectrum of the Quaternion Fourier Transform.限制于rendering the models unable to simultaneously model image saliency
Unified Image and Video Saliency Modeling
进入正题如何建模
Domain-Shift Modeling
首先对DHF1K, Hollywood-2 and UCF-Sports和SALICON数据集进行了域偏差分析the domain shift analyses。
处理的方法:随机从四个数据集合中抽取256帧画面,输入到pre-trained MobileNetV2网络中得到输出的特征矢量后,使用t-SNE算法降维可视化。
Domain-Adaptive Batch Normalization
为了减少数据集合之间的的均值方差,有利于神经网络的训练,一个标准化的操作流程。
Batch normalization (BN) aims to reduce the internal covariate shift of neural network activations by transforming their distribution to zero mean and unit variance for each training batch.
采用两种策略进行Normalization:
- domain-invariant: all samples 所有样本的均值方差
- domain-adaptive: the samples from the respective dataset ,分别进行标准化
如下图对比可见,经过标准化处理后的数据集在聚类后明显聚类现象得到减缓。因此模型针对每个数据集合建立了对应的BM modules
.
Domain-Adaptive Priors
不同数据集和的注视区域中央聚集偏差strongest center bias程度不同,但 Hollywood-2 and UCF 数据集合表现十分明显,但图像数据集SALICON的散度又十分显著,这可能是由于图像数据集每张图5s的观察时间所决定。因此提出learn a separate set of Gaussian prior maps
for each dataset.
- 之前的试验结果发现,车速越高,注视区域越小
- 危险场景注视区域分布更广
Domain-Adaptive Fusion
不同数据间的人员采集状态不相同
Hollywood-2 and UCF Sports datasets are task-driven, the viewer is instructed to identify the main action shown.The DHF1K dataset contains free-viewing fixations.
- 这个跟下面的讨论有什么关系?
回答不同的数据集是否需要不同的Fussion layer(1X1 convolution)
进行特征层融合
方法: 搭建一个简单的saliency modeling,使用MNet V2提取特征后,使用一个Fusion layer (1X1 convolution)进行特征层提取,对比两种建模方法的性能差异。
- domain-invariant: same weights for all datasets,所有数据集采用相同的权重系数。
- domain-adaptive: different weights for each dataset,不同数据选择不同的权重系数。
结果如下,明显domain-adaptive性能更为突出。
Domain-Adaptive Smoothing
不同数据集的模糊滤波方法不相同blurring filter,将数据集resize到网络输入后,计算数据得锐度sharpnnes分布如下,可见存在较大差异。尤其是DHF1K有最大的锐度sharpness(maximum gradient)。
因此提出了针对数据集建立smoothing kernel blur the network output with a different learned Smoothing kernel
for each dataset.
The upsampling is followed by a Domain- Adaptive Smoothing layer with
41x 41 convolutional kernels
that explicitly model the dataset-dependent blurring of the ground-truth saliency maps.
UNISAL Network Architecture
总体是一个encoder-RNN-decoder框架结构
Encoder Network
主干选用了MobileNet-V2三点原因:
- 由于网络参数相对较少,增大网络batch和视频输入帧
- 提高模型inference效率,实现实时监控
- 避免over fitting
由于采用的是ImageNet-pretrained parameters网络输入标准化格式固定,Domain-Adaptive Batch Normalization并未被使用
。
Gaussian Prior Maps
domain-adaptive Gaussian prior maps计算公式如下
g
(
i
)
(
x
,
y
)
=
γ
exp
(
−
(
x
−
μ
x
(
i
)
)
2
(
σ
x
(
i
)
)
2
−
(
y
−
μ
y
(
i
)
)
2
(
σ
y
(
i
)
)
2
)
g^{(i)}(x, y)=\gamma \exp \left(-\frac{\left(x-\mu_{x}^{(i)}\right)^{2}}{\left(\sigma_{x}^{(i)}\right)^{2}}-\frac{\left(y-\mu_{y}^{(i)}\right)^{2}}{\left(\sigma_{y}^{(i)}\right)^{2}}\right)
g(i)(x,y)=γexp⎝⎜⎛−(σx(i))2(x−μx(i))2−(σy(i))2(y−μy(i))2⎠⎟⎞
参数:gama= 6,由MNetV2 中RELU6决定。 we propose unconstrained Gaussianprior maps
,instead of drawing the initial Gaussian parameters from a normal distribution, which results in highly correlated maps
从b图可以看出x, y成正太分布。Gaussian Prior Maps放在了CNN和RNN之间,to model the static center bias;in order to leverage the prior maps in higher-level features
- c图没有看懂
原文instead of drawing the initial Gaussian parameters from a normal distribution, which results in highly correlated maps, we initialize NG = 16 maps,避免受一个初始分布的先验影响,提出了c图中16组不同的初始滤波组合。
Bypass-RNN
之前提取空间特征的方法·RNN, optical flow or 3D convolutions都不适用于静态图。提出了Bypass-RNN模块,残差连接的方式选择提取图像特征。
RNN whose output is added to its input features via a residual connection that is automatically omitted (bypassed) for static batches during training and inference
使用convolutional GRU (cGRU) RNN构建Bypass-RNN模块
Decoder Network and Smoothing
Decoder的细节如下,为降低参数量使用了比较少多的深度可分离卷积 depthwise separable convolution以及逐点卷积pointwise 1x1 convolution。
The Post-US2 features are reduced to a single channel by an Domain-Adaptive Fusion layer
(1x1 convolution) and upsampled to the input resolution via nearest-neighbor interpolation.
The upsampling is followed by a Domain- Adaptive Smoothing layer
with 41x41 convolutional kernels that explicitly model the dataset-dependent blurring of the ground-truth saliency maps.
Domain-Aware Optimization
每个数据集合图像的长宽比不同,输入网络经过Domain-Adaptive Batch Normalization,可以实现特定网络特定尺寸输入。
- MobileNet V2输入尺寸不是固定的吗
The images/frames of the training datasets each have different aspect ratios, specifically 4:3 for SALICON, 16:9 for DHF1K, and 1.85:1 (median) for Hollywood-2, and 3:2 (median) for UCF Sports.
we use input resolutions of 288384, 224384, 224416 and 256384 for SALICON, DHF1K,
Hollywood-2 and UCF Sports, respectively.
不同视频数据集合帧率不同,分别隔5和4取一帧画面,保证网络画面输入6HZ
The frame rate of the DHF1K videos is 30 fps compared to 24 fps for Hollywood-2 and UCF Sports.
In order to assimilate the frame rates during training, and to train on longer time intervals, we construct clips using every 5th frame for DHF1K and every 4th frame for all others, yielding 6 fps overall.
Experiments
Experimental Setup
数据集合的划分
For SALICON, we use the official training/validation/testing split of 10,000/5,000/5,000.
For Hollywood-2 and UCF Sports,we use the training and testing splits of 823/884 and 103/47 videos, and the corresponding validation sets are randomly sampled 10% from the training sets, following.
For Hollywood-2, the videos are divided into individual shots.
For the DHF1K dataset, we use the official training/validation/testing splits of 600/100/300 videos.
Stochastic Gradient Descent with momentum of 0.9 and weight decay of 104. Gradients are clipped to 2.
The learning rate is set to 0.04 and exponentially decayed by a factor of 0.8 after each epoch.
The batch size is set to 4 for video data and 32 for SALICON.
The video clip length is set to 12 frames that are sampled as described in Section 3.3
To prevent overfitting, the weights of MNet V2 are frozen for the first two epochs and afterwards trained with a learning rate that is reduced by a factor of 10. The pretrained BN statistics of MNet V2 are frozen throughout training
目标尺寸是的原始图像尺寸
数据集 | 类型 | 图像尺寸 | 帧率 | 网络输入 | 网络输出 | 目标尺寸 |
---|---|---|---|---|---|---|
DHF1K | Video | 640x360 | 30 | 384x224 | 384x224 | 640x360 |
Hollywood | Video | 640x304 | 24 | 416x224 | 416x224 | 640x304 |
UCFSports | Video | 720x404 | 24 | 384x256 | 384x256 | 720x404 |
SALICON | Image | 640X480 | / | 384x288 | 384x288 | 720x404 |
注意:Google的系列卷积网络(包括MobileNetV2)输入格式都是224x224,为保证与训练网络的权重能很好的迁移到现有网络中,除需要保证原有图像的长宽比,还要保证图像尺寸的匹配。
由于原文中也没有细致解释图像分辨率(resolutions)的定义方法,个人推测网络
的输入输出最小为尺寸设置为224,对应于16:9的图像长就是384,以这两个数为基准,确定不同数据集的图像尺寸的定义。
对于视频网络一次输入12frames,即过去2s的画面输入(对于30fps的视频,每5帧取1帧
,2s12frames;对于24fps的视频,每4帧取1帧
,2s12frames)
Quantitative Evaluation
一个大表罗列了不同模型在不同视频数据集合上的表现,以及不同数据集训练后的UNISAL的训练结果。训练数据样本越多模型的预测效果越好
Qualitative Evaluation
Ablation Study
对比了增加不同模块后性能提升的效果,说明所增加模块在性能提升上确实存在的意义
Inter-Dataset Domain Shift
进一步通过网络中间特征数据的分析论证四个方法的有效性。four domain-adaptive modules, i.e., DABN, Domain- Adaptive Fusion, Domain-Adaptive Priors and Domain-Adaptive Smoothing.
- (a)DABN,中相关系数越高不是越不好吗
- (b)直方图啥意思没看懂
Computational Load
最后哦再修了一波计算经济型
原文
https://arxiv.org/abs/2003.05477