一、作者
Lv Tang ,Bo Li1,Yijie Zhong ,Shouhong Ding, Mofei Song2, Youtu Lab
二、地址
2.1 原文地址
ICCV 地址
2.2 代码地址
代码
三、摘要
Aiming at discovering and locating most distinctive objects from visual scenes, salient object detection (SOD) plays an essential role in various computer vision systems. Coming to the era of high resolution, SOD methods are facing new challenges. The major limitation of previous methods is that they try to identify the salient regions and estimate the accurate objects boundaries simultaneously with a single regression task at low-resolution. This practice ignores the inherent difference between the two difficult problems, resulting in poor detection quality. In this paper, we propose a novel deep learning framework for high-resolution SOD task, which disentangles the task into a low-resolution saliency classification network (LRSCN) and a high-resolution refinement network (HRRN). As apixel-wise classification task, LRSCN is designed to capture sufficient semantics at low-resolution to identify the definite salient(most pixels inside the salient object have the highest saliency value), background( most pixels in the background regions have the lowest salient value) and uncertain image regions(saliency values of the pixels at blurry object boundaries fluctuate between 0 and 1). HRRN is a regression task, which aims at accurately refining the saliency value of pixels in the uncertain region to preserve a clear object boundary at high-resolution with limited GPU memory. It is worth noting that by introducing uncertainty into the training process, our HRRN can well address the high-resolution refinement task without using any high-resolution training data. Extensive experiments on high-resolution saliency datasets as well as some widely used saliency benchmarks show that the method achieves superior performance compared to the state-of -the-art methods.
四、主要内容
4.1 主要工作
- We provide a new perspective that high-resolution salient object detection should be disentangled into two tasks, and demonstrate that the disentanglement of the two tasks is essential for improving the performance of DNN based SOD models.
- Motivated by the principle of disentanglement, we propose a novel framework for high-resolution salient object detection, which uses LRSCN to capture sufficient semantics at low-resolution and HRRN for accurate boundary refinement at high-resolution.
- We make the earliest efforts to introduce the uncertainty into SOD network training, which empowers HRRN to well address the high-resolution refinement task without any high-resolution training datasets.
- We perform extensive experiments to demonstrate the proposed method refreshes the SOTA performance on high-resolution saliency datasets as well as some widely used saliency benchmarks by a large margin.
4.2 网络结构 (VGG-16)
4.3 MECF and AGA
4.3.1 ME ( based on Global Convolutional Network (GCN))
4.3.2 CF (utilize cross-level feature fusion module)
4.3.3 SGA ( guarantees the alignment of trimap and saliecny map)
4.4 LOSS
4.4.1 LRSCN Loss
$ T^{gt}$ : trimap groundtruth
T g t ( x , y ) = { 2 , T g t ( x , y ) ϵ d e f i n i t e s a l i e n t 0 , T g t ( x , y ) ϵ d e f i n i t e b a c k g r o u n d 1 , T g t ( x , y ) ϵ d e f i n i t e u n c e r t a i n r e g i o n T^{gt}(x, y) = \begin{cases} 2, \ T^{gt}(x,y)\epsilon \ definite\ salient \\ 0,T^{gt}(x,y)\epsilon \ definite\ background \\ 1, T^{gt}(x,y)\epsilon \ definite\ uncertain region \end{cases} Tgt(x,y)=⎩⎪⎨⎪⎧2, Tgt(x,y)ϵ definite salient0,Tgt(x,y)ϵ definite background1,Tgt(x,y)ϵ definite uncertainregion
L t r i m a p = 1 N ∑ i − l o g ( e T i ∑ j e T j ) L_{trimap}=\frac{1}{N}\sum_i-log(\frac{e^{T_i}}{\sum_j e^{T_j}}) Ltrimap=N1i∑−log(∑jeTjeTi)
L L R S C N = L s a l i e n c y + L t r i m a p L s a l i e n c y : B C E + S S I M + F − m e a s u r e L_{LRSCN}=L_{saliency}+L_{trimap}\\ L_{saliency}:BCE+SSIM+F-measure LLRSCN=Lsaliency+LtrimapLsaliency:BCE+SSIM+F−measure
4.4.2 HRRN Loss
uncertainty loss will make the weight of the loss in the uncertainty region be small and let the network ignore effects from noisy data as much as possible.
L 1 = 1 E ∑ i ϵ E ∣ S i H − G i H ∣ E : n u m b e o f p i x e l s L_1 = \frac{1}{E} \sum_{i\epsilon E}|S_i^H - G_i^H| \\ E:numbe\ of\ pixels L1=E1iϵE∑∣SiH−GiH∣E:numbe of pixels
L u n c e r t a i n t y = 1 U ∑ i ϵ U ∣ ∣ S i H − G i H ∣ ∣ 2 2 σ i 2 + 1 2 l o g σ i 2 U : t o t a l n u m b e r o f p i x e l s i n u n c e r t a i n r e g i o n L_{uncertainty} = \frac{1}{U} \sum_{i\epsilon U}\frac{{||S_i^H - G_i^H||}^2}{2\sigma_i^2}+ \frac{1}{2}log\sigma_i^2 \\ U:total\ number\ of\ pixels\ in\ uncertain\ region Luncertainty=U1iϵU∑2σi2∣∣SiH−GiH∣∣2+21logσi2U:total number of pixels in uncertain region
L H R R N = L u n c e r t a i n t y + L 1 L_{HRRN} = L_{uncertainty} + L_1 LHRRN=Luncertainty+L1
五、评估材料
- MAE
- F-measure ( F β F_\beta Fβ and F β m a x F_\beta ^{max} Fβmax)
- Structure Measure
- RP曲线
- BDE ( Boundary Displacement Error) 边界漂移误差
- B μ B_\mu Bμ
六、结论
In this paper, we argue that there are two difficult and inherently different problems in high-resolution SOD. From this perspective, we propose a novel deep learning framework to disentangle the high-resolution SOD into two tasks: LRSCN and HRRN. LRSCN can identify the definite salient, background and uncertain regions at low-resolution with sufficient semantics. While HRRN can accurately refining the saliency value of pixels in the uncertain region to preserve a clear object boundary at high-resolution with limited GPU memory. We also make the earliest efforts to introduce the uncertainty into SOD network training, which empower HRRN to learn rich details without using any high-resolution training datasets. Extensive evaluations on high-resolution datasets and popular benchmark datasets not only verify the superiority of our method but also demonstrate the importance of disentanglement for SOD. We believe our novel disentanglement view in this work can contribute to other high-resolution computer vision tasks in the future.