Salient Object Detection and Segmentation

Salient Object Detection and Segmentation

Ming-Ming Cheng1,4  Niloy J. Mitra2  Xiaolei Huang3  Philip H. S. Torr4  Shi-Min Hu1

1TNList, Tsinghua University      2UCL/KAUST     3Lehigh University     4Oxford Brookes University

Figure. Given input images (top), a global contrast analysis is used to compute high resolution saliency maps (middle), which can be used to produce masks (bottom) around regions of interest.

Abstract

Automatic estimation of salient object regions across images, without any prior assumption or knowledge of the contents of the corresponding scenes, enhances many computer vision and computer graphics applications. We introduce a regional contrast based salient object extraction algorithm, which simultaneously evaluates global contrast differences and spatial weighted coherence scores. The proposed algorithm is simple, efficient, naturally multi-scale, and produces full-resolution, high-quality saliency maps. These saliency maps are further used to initialize a novel iterative version of GrabCut for high quality salient object segmentation. We extensively evaluated our algorithm using traditional salient object detection datasets, as well as a more challenging Internet image dataset. Our experimental results demonstrate that our algorithm consistently outperforms existing salient object detection and segmentation methods, yielding higher precision and better recall rates. We also show that our algorithm can be used to efficiently extract salient object masks from Internet images, enabling effective sketch-based image retrieval (SBIR) via simple shape comparisons. Despite such noisy internet images, where the saliency regions are ambiguous, our saliency guided image retrieval achieves a superior retrieval rate compared with state-of-the-art SBIR methods, and additionally provides important target object region information.

Papers

  • Global Contrast based Salient Region Detection. Ming-Ming Cheng, Guo-Xin Zhang, Niloy J. Mitra, Xiaolei Huang, Shi-Min Hu. IEEE CVPR, 2011, p. 409-416.  [Project page] [C++] [Bib
  • Salient Object Detection and Segmentation. Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, Shi-Min Hu. Submitted to IEEE TPAMI (TPAMI-2011-10-0753), 2011. [Paper] [Poster]

Comparisons with state of the art methods

Figure. Statistical comparison results of (a) different saliency region detection methods, (b) their variants, and (c) object of interest region segmentation methods, using largest public available dataset (i) and (ii) our THUS10000 dataset (to be made public available). We compare our HC method and RC method with 15 state of art methods, including FT [1], AIM [2], MSS [3], SEG [4], SeR [5], SUN [6], SWD [7], IM [8], IT [9], GB [10], SR [11], CA [12], LC [13], AC [14], and CB [15]. We also take simple variable-size Gaussian model 'Gau' and GrabCut method as a baseline. (Please see our paper for detailed explaintions)

Figure. Comparison of average Fβ for different saliency segmentation methods: FT [1], SEG [4], and ours, on THUR15000 dataset, which is composed by non-selected internet images.

MethodITAIMIMMSSSEGSeRSUNSWDCB
Time (s)0.6114.2880.9910.1064.9211.0191.1160.1005.568
Code TypeMatlabMatlabMatlabMatlabMatlabMatlabMatlabMatlabMatlab & C++
MethodGBSRFTACCALCHCRC 
Time (s)1.6140.0640.1020.10953.10.0180.0190.254 
Code TypeMatlabMatlabC++MatlabMatlabC++C++C++ 

Table. Average time taken to compute a saliency map for images in the THUS10000 database. (Note that we use the authors original implementations for MSS and FT, which is not well optimized code.)

 

MethodFTSEGCBOur
Time (s)0.2477.4836.50.621
Code TypeMatlabMatlab & C++Matlab & C++C++

Table. Comparison of average time for different saliency segmentation methods.

Figure. Saliency maps computed by different state-of-the-art methods~(b-p), and with our proposed HC~(q) and RC methods~(r). Most results highlight edges, or are of low resolution. See also the shared data for saliency detection results for the whole THUS10000 dataset.

Figure. Sketch based image comparison. In each group from left to right, first column shows images download from Flickr using the corresponding keyword; second column shows our retrieval results obtained by comparing user input sketch with SaliencyCut result using shape context measure [41]; third column shows corresponding sketch based retrieval results using SHoG [42].

Downloads

Some files are zip format with password. They will be available after corresponding paper be accepted. Red links are now avalible!

1. Data

    The THUS10000 benchmark dataset comprises of 10, 000 images (181 MB), each of which has an unambiguous salient object and the object region is accurately annotated with pixel wise ground-truth labeling (13.1M). We provide saliency maps (5.3 GB containing 170, 000 image) for our methods as well as other 15 state of the art methods, including FT [1], AIM [2], MSS [3], SEG [4], SeR [5], SUN [6], SWD [7], IM [8], IT [9], GB [10], SR [11], CA [12], LC [13], AC [14], and CB [15]. Saliency segmentation (71.3MB) results for FT[1], SEG[4], and CB[10] are also avilable.

2. Windows executable

    We supply an windows msi for install our prototype software, which includes our implementation for FT[2], SR[14], LC[28], our HC, RC and saliency cut method.

3. C++ source code

    The C++ implementation of our paper as well as several other state of the art works.

4. Supplemental material

    Supplemental materials (647 MB) including comparisons with other 15 state of the art algorithms are now available.

FAQs

Until now, more than 1000+ readers (according to email records) have request to get the source code for this project. Some of them have questions about using the code. Here are some frequently asked questions for new users to refer:

Q1: I’m confused with the sentence in the paper: “In our experiments, the threshold is chosen empirically to be the threshold that gives 95% recall rate in our fixed thresholding experiments”. But all most the case, people have not the ground truth, so cannot compute the call rate. When I use your Cut application, I need to guess threshold value to have good cut image.

A: The recall rate is just used to evaluate the algorithm. When you use it, you typically don't have to evaluate the algorithm itself very often. This sentence is used to explain what the fixed threshold we use typically means. Actually, when initialized using RC saliency maps, this threshold is 70 with saliency values normalized to [0,255]. It doesn’t mean that the saliency values corresponds to recall rate of 95% for every image, but empirically corresponds to recall rate of 95% for a large number of images. So, just use the suggested threshold of 70 is OK.

Q2: I use your code to get results for the same database you used. But the results seem to have some small difference from yours.

A: It seems that the cvtColor function in OpenCV 1.x is different from those in OpenCv 2.X. I suggest users to use those in recent versions. The segmentation method I used sometimes generates strange results, leading to strange results of saliency maps. This happens at low frequency. When this happens, I rerun the exe again and it becomes OK. I don't know why, but this really happens when I use the exe first time after compiling (Very strange, maybe because some default initializations). If someone find the bug, please report to me.

Q3: Does your algorithm only get good results for images with single salient object?

A: Mostly yes. As described in our paper, our method is suitable for images with an unambiguous saliency object. Since saliency detection methods typically have no prior knowledge about the target object, thus is very difficult. Much recent researches focus on images with single saliency object. Even for this simple case, state of the art algorithm may also fail. It's understandable since supervised object detection which uses a large number of training data and prior knowledge also fails in many cases.

However, the value of saliency detection methods lies on their applications in many fields. Because they don't need large human annotation for learning, and typically much faster than object detection methods, it’s possible to automatically process a large number of images with low cost. Although many of the saliency detection results may be wrong (up to 60% for noise internet image) because of the ambiguous or even missing of salient objects, we can still use efficient algorithms to select those good results and use them in many interesting applications like (Notes: all following projects use our saliency source code, with initial version of SaliencyCut used in our own Sketch2Photo project):

  1. Image retrieval: Sketch2Photo: Internet Image Montage. Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, Shi-Min Hu. ACM SIGGRAPH Asia. 28, 5, 124:1-10, 2009.
  2. Image editing: Semantic Colorization with Internet Images, Yong Sang Chia, Shaojie Zhuo, Raj Kumar Gupta, Yu-Wing Tai, Siu-Yeung Cho, Ping Tan, Stephen Lin, ACM SIGGRAPH Asia. 2011.
  3. View selection: Web-Image Driven Best Views of 3D Shapes. The Visual Computer, 2011. Accepted. H Liu, L Zhang, H Huang
  4. Image Collage: Arcimboldo-like Collage Using Internet Images.ACM SIGGRAPH Asia, 30(6), 2011. H Huang, L Zhang, HC Zhang
  5. Image manipulation: Data-Driven Object Manipulation in Images. Chen Goldberg, Eurographics 2012, T Chen, FL Zhang, A Shamir, SM Hu.
  6. Saliency For Image Manipulation, R. Margolin, L. Zelnik-Manor, and A. Tal, Computer Graphics International (CGI) 2012.
  7. Mobile Product Search with Bag of Hash Bits and Boundary Reranking,  Junfeng He, Xianglong Liu, Tao Cheng, Jinyuan Feng, Tai-Hsu Lin, Hyunjin Chung and Shih-Fu Chang, IEEE CVPR, 2012
  8. SalientShape: Group Saliency in Image Collections. Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Shi-Min Hu. The Visual Computer, 2013
  9. Much more: http://scholar.google.com/scholar?cites=9026003219213417480

Q4: I'm confused about the definition of saliency. Why the annotation format (isolated points, binary mask regions, and bounding boxes) in different benchmarks for evaluating saliency detection methods are so different?

There are 3 different saliency detection directions: i) fixation prediction, ii) salient object detection, iii) objectness estimation. They have very different research target and very different applications. Personally, I’m mainly interested in the last two problems and will discuss them in a bit more detail.

Eye fixation models aims at predicting where human looks, i.e. a small set of fixation points. The most famous method in this area is Itti’s work in PAMI 1998. The MIT benchmark is designed for evaluating such methods.

Salient object detection, as what is done in this work, aim at finding most salient object in a scene and segment the whole extent of that object. The output is typically a single saliency map (or figure-ground segmentation). The advantages and disadvantages are described in detail in Q3. High precision is a major focus of our work, as we can use shape matching based technique to effectively select good segmentations and build robust applications on top. Most widely used benchmark for evaluating this problem is MSRA1000, which precisely segment 1000 salient objects in MSRA images. Our method achieves 93% precision and 90% recall on MSRA1000 (previous best reported results: 75% precision and 83% recall). Since our results on MSRA100 are mostly comparable to ground truth annotations, we need more challenging benchmark. THUS10000 and THUR15000 are built for this purpose.

Objectness estimation is another attractive direction. These methods aim at proposing a small set (typically 1000) of bounding boxes to improve efficiency of classical sliding window pipeline. High recall at a small set of bounding box proposals is a major target. PASCAL VOC is a standard dataset for evaluating this problem. Using purely bottom up data driven methods to produce a single saliency map, as what is done in most salient object detection model, is less likely to succeed in this very challenging dataset. State of the art objectness proposal methods (PAMI12, IJCV13) achieves 90+% recall on challenging PASCAL VOC dataset given a relatively small (e.g. 1000) number of bounding boxes, while been computational efficient (4 seconds per image). This is especially useful for speed up multi-class object detection problem, as each classifier only need to examine a much smaller number of image windows (e.g. 1,000,000 -> 1,000).

Links to source code of other methods

FT[1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk,“Frequency-tuned salient region detection,” in IEEE CVPR, 2009, pp. 1597–1604.
AIM[2] N. Bruce and J. Tsotsos, “Saliency, attention, and visual search: An information theoretic approach,” Journal of Vision, vol. 9, no. 3, pp. 5:1–24, 2009.
MSS[3] R. Achanta and S. S ¨ usstrunk, “Saliency detection using maximum symmetric surround,” in IEEE ICIP, 2010, pp. 2653–2656.
SEG[4] E. Rahtu, J. Kannala, M. Salo, and J. Heikkila, “Segmenting salient objects from images and videos,” ECCV, pp. 366–379, 2010.
SeR[5] H. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance,” Journal of vision, vol. 9, no. 12, pp. 15:1–27, 2009.
SUN[6] L. Zhang, M. Tong, T. Marks, H. Shan, and G. Cottrell, “SUN: A bayesian framework for saliency using natural statistics,” Journal of Vision, vol. 8, no. 7, pp. 32:1–20, 2008.
SWD[7] L. Duan, C. Wu, J. Miao, L. Qing, and Y. Fu, “Visual saliency detection by spatially weighted dissimilarity,” in IEEE CVPR, 2011, pp. 473–480.
IM[8] N. Murray, M. Vanrell, X. Otazu, and C. A. Parraga, “Saliency estimation using a non-parametric low-level vision model,” in IEEE CVPR, 2011, pp. 433–440.
IT[9] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE TPAMI, vol. 20, no. 11, pp. 1254–1259, 1998.
GB[10] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in NIPS, 2007, pp. 545–552.
SR[11] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in IEEE CVPR, 2007, pp. 1–8.
CA[12] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” in IEEE CVPR, 2010, pp. 2376–2383.
LC[13] Y. Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in ACM Multimedia, 2006, pp. 815–824.
AC[14] R. Achanta, F. Estrada, P. Wils, and S. S ¨ usstrunk, “Salient region detection and segmentation,” in IEEE ICVS, 2008, pp. 66–75.
CB[15] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li,“Automatic salient object segmentation based on context and shape prior,” in British Machine Vision Conference, 2011, pp. 1–12.
LP[16] T. Judd, K. Ehinger, F. Durand, A Torralba, Learning to predict where humans look, ICCV 2009. 
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值