# 一、作者
Tao Zhou , Huazhu Fu ,Geng Chen ,Yi Zhou ,Deng-Ping Fan, Ling Shao
二、地址
2.1 原文地址
ICCV 地址
2.2 代码地址
代码
三、摘要
RGB-D saliency detection has attracted increasing attention, due to its effectiveness and the fact that depth cues can now be conveniently captured. Existing works often focus on learning a shared representation through various fusion strategies, with few methods explicitly considering how to preserve modality-specific characteristics. In this paper, taking a new perspective, we propose a specificity preserving network (SP-Net) for RGB-D saliency detection, which benefits saliency detection performance by exploring both the shared information and modality-specific properties (e.g., specificity). Specifically, two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. A cross enhanced integration module (CIM) is proposed to fuse cross-modal features in the shared learning network, which are then propagated to the next layer for integrating cross level information. Besides, we propose a multi-modal feature aggregation (MFA) module to integrate the modality specific features from each individual decoder into the shared decoder, which can provide rich complementary multi-modal information to boost the saliency detection performance. Further, a skip connection is used to combine hierarchical features between the encoder and decoder layers. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
四、主要内容
4.1 主要工作
• We propose a novel specificity-preserving network for RGB-D saliency detection (SP-Net), which can explore the shared information as well as preserve modality-specific characteristics.
• We propose a cross-enhanced integration module (CIM) (互补互动模块) to fuse the cross-modal features and learn shared representations for the two modalities. The output of each CIM is then propagated to the next layer to capture cross-level information.
• We propose a simple but effective multi-modal feature aggregation (MFA) module to integrate these learned modality-specific features. It makes full use of the features learned in the modality-specific decoder to boost the saliency detection performance.
• Extensive experiments on six public datasets demonstrate the superiority of our model over thirty benchmarking methods. Moreover, we carry out an attributebased evaluation to study the performance of many state-of-the-art RGB-D saliency detection methods under different challenging factors (e.g., number of salient objects, indoor or outdoor environments, and light conditions), which has not been done previously by existing studies.
4.2 网路结构图
4.2.1 总体结构图 (Res2Net-50)
The overall architecture of the proposed SP-Net. Our model consists of two modality-specific learning networks and a shared learning network. The modality-specific learning networks are used to preserve the individual properties for each modality (i.e., RGB or depth), while the shared network is used to fuse cross-modal features and explore their complementary information. Skip connections are adopted to combine hierarchical features between the encoder and decoder layers. The learned features from the modality-specific decoder are integrated into the shared decoder to provide rich multi-modal complementary information for boosting saliency detection performance. Here, “C” denotes feature concatenation.
4.2.2 CIM (to learn shared feature representation)
4.2.3 MFA (to integrate features into the shared decoder)
4.3 LOSS
L t o t a l = L s h ( S s h , G ) + L s p ( S R , G ) + L s p ( S D , G ) L s p : m o d a l i t y − s p e c i f i c L s h : s h a r e d d e c o d e r s S R , S D : p r e d i c t i o n m a p s w h e n u s i n g R G B a n d d e p t h i m a g e s S s h : p r e d e c t i o n m a p u s i n g t h e i r s h a r e d r e p r e s e n t a t i o n G : g r o u n d t r u t h m a p L_{total}=L_{sh}(S_{sh},G)+L_{sp}(S_R,G)+L_{sp}(S_D,G)\\ L_{sp}:modality-specific\\ L_{sh}:shared\ decoders\\ S_R,S_D:prediction\ maps\ when\ using\ RGB\ and\ depth\ images\\ S_{sh}:predection\ map\ using\ their\ shared\ representation\\ G: ground\ truth \ map Ltotal=Lsh(Ssh,G)+Lsp(SR,G)+Lsp(SD,G)Lsp:modality−specificLsh:shared decodersSR,SD:prediction maps when using RGB and depth imagesSsh:predection map using their shared representationG:ground truth map
五、评估材料
S-measure、Eϕ 、F-measure、MAE
S α = α ∗ S o + ( 1 − α ) ∗ S r S r : r e g i o n p e r c e p t i o n S o : o b j e c t p e r c e p t i o n S_\alpha = \alpha * S_o + (1-\alpha)*S_r \\ S_r :region\ perception\\ S_o: object\ perception Sα=α∗So+(1−α)∗SrSr:region perceptionSo:object perception
E ϕ = 1 W ∗ H ∑ i = 1 W ∑ j = 1 H ϕ F M ( i , j ) E_\phi=\frac{1}{W*H}\sum_{i=1}^{W}\sum_{j=1}^{H}\phi FM(i, j) Eϕ=W∗H1i=1∑Wj=1∑HϕFM(i,j)
F β = ( 1 + β 2 ) P ∗ R β 2 P + R F_\beta = (1+\beta^2)\frac{P*R}{\beta ^2 P+ R} Fβ=(1+β2)β2P+RP∗R
六、结论
In this paper, we have proposed a novel SP-Net for RGB-D saliency detection. Different from most existing works, which mainly focus on learning shared representations, our model not only explores the shared cross-modal information but also compensates modality-specific characteristics to boost the saliency detection performance. Besides, the proposed CIM can propagate information across modalities and layers, while our MFA module can provide specific properties to the shared decoder to enhance the complementary multi-modal information. Quantitative and qualitative evaluations conducted on six challenging benchmark datasets demonstrate the superiority of our SP-Net over other existing RGB-D saliency detection approaches. In the future, we can apply our model to the light field saliency detection task .