Richer Edge南开: Richer Convolutional Features for Edge Detection

题目:Richer Convolutional Features for Edge Detection

作者:Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Jia-Wang Bian, Le Zhang, Xiang Bai, and Jinhui Tang

刊物:IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE TPAMI

常用语句:

(因为在xxx领域的广泛应用)Due to its far-reaching applicationsin xxx applications

Abstract:

Edge detection is a fundamental problem in computer vision.
Recently, convolutional neural networks (CNNs) have pushed forward this field significantly. 
Existing methods which adopt specific layers of deep CNNs may fail to capture complex data structures caused by variations of scales and aspect ratios. 
In this paper, we propose an accurate edge detector using richer convolutional features (RCF). 
RCF encapsulates all convolutional features into more discriminative representation, which makes good usage of rich feature hierarchies, and is amenable to training via backpropagation. 
RCF fully exploits multiscale and multilevel information of objects to perform the image-to-image prediction holistically. 
Using VGG16 network, we achieve state-of-the-art performance on several available datasets. 
When evaluating on the well-known BSDS500 benchmark, we achieve ODS F-measure of 0.811 while retaining a fast speed (8 FPS). 
Besides, our fast version of RCF achieves ODS F-measure of 0.806 with 30 FPS. 
We also demonstrate the versatility of the proposed method by applying RCF edges for classical image segmentation.

边缘检测是计算机视觉中的一个基本问题。
近年来,卷积神经网络(convolutional neural networks, CNNs)在这一领域有了很大的发展。
现有的采用特定深度CNNs层的方法可能由于尺度和纵横比的变化无法捕获复杂的数据结构
在本文中,我们提出了一种使用更丰富的卷积特征(RCF)的精确边缘检测器。
RCF将所有卷积特性封装成更有区别的表示形式,充分利用了丰富的特性层次结构,并且可以通过反向传播进行训练。
RCF充分利用了对象的多尺度、多水平信息,实现了对图像的整体预测。
使用VGG16网络,我们在几个可用的数据集上实现了最先进的性能。
当在著名的BSDS500基准上进行评估时,
我们实现了ODS F-measure 0.811,同时保持了较快的速度(8 FPS)。
此外,我们的快速RCF版本30 FPS的速度达到了ODS F-measure 0.806
通过将RCF边缘应用于经典图像分割,验证了该方法的通用性。

INTRODUCTION

EDGE detection can be viewed as a method to extract visually salient edges and object boundaries from natural images. 
Due to its far-reaching applications in many high-level applications including object detection [2], [3], object proposal generation [4], [5], and image segmentation [6], [7], edge detection is a core low- level problem in computer vision.

边缘检测可以看作是一种从自然图像中提取视觉突出边缘和物体边界的方法。
边缘检测在对象检测[2]、[3]、对象建议生成[4]、[5]、图像分割[6]、[7]等高级应用中应用广泛,是计算机视觉中一个核心的低级问题。

The fundamental scientific question here is what is the appropriate representation which is rich enough for a predictor to distinguish edges/boundaries from the image data.
To answer this, traditional methods first extract the local cues of brightness, color, gradient and texture,
or other manually designed features like Pb [8] and gPb [9], then sophisticated learning paradigms [10] are used to classify edge and non-edge pixels.
Although low-level features based edge detectors are somehow promising, their limitations are obvious as well.
For example, edges and boundaries are often defined to be semantically meaningful, however, it is difficult to use low-level cues to represent high-level information.
Recently,convolutional neural networks (CNNs) have become popular in computer vision [11], [12].
Since CNNs have a strong capability to automatically learn the high-level representations for natural images, there is a recent trend of using CNNs to perform edge detection.
Some well-known CNN-based methods have pushed forward this field significantly, such as DeepEdge [13], N4-Fields [14], DeepContour [15], and HED [16].
Our algorithm falls into this category as well.

这里的基本科学问题是什么是适当的表示法,该表示法足够丰富,可以使预测器从图像数据中区分边缘/边界。
为了解决这个问题,传统的方法首先提取亮度、颜色、梯度和纹理等局部线索,
或者其他手动设计的特征,如Pb[8]和gPb[9],然后使用复杂的学习范式[10]对边缘和非边缘像素进行分类。
尽管基于底层特征的边缘检测器在某种程度上很有前途,但是它们的局限性也很明显。
例如,边缘边界通常被定义为语义上有意义的,然而,很难使用低层次的线索表示高层次的信息
近年来,卷积神经网络在计算机视觉[11]、[12]中得到了广泛的应用。
由于CNNs具有很强的自动学习自然图像高级表示的能力,因此利用CNNs进行边缘检测是近年来的发展趋势。
一些著名的基于cnn的方法,如DeepEdge[13]、N4-Fields[14]、DeepContour[15]、HED[16]等,对这一领域的研究有很大的推动作用。
我们的算法也属于这一类。

As illustrated in Fig. 1, we build a simple network to produce side outputs of intermediate layers using VGG16 [11] with HED architecture [16].
We can see that the information obtained by different convolution (i.e. conv ) layers gradually becomes coarser.
More importantly, intermediate conv layers contain essential fine details.
However, previous CNN architectures only use the final conv layer or the layers before the pooling layers of neural networks, but ignore the intermediate layers.
On the other hand, since richer convolutional features are highly effective for many vision tasks, many researchers make efforts to develop deeper networks [17].
However, it is difficult to get the networks to converge when going deeper because of vanishing/exploding gradients and training data shortage (e.g. for edge detection).
So why don’t we make full use of the CNN features we have now?
Based on these observations, we propose richer convolutional features (RCF), a novel deep structure fully exploiting the CNN features from all the conv layers, to perform the pixel-wise prediction for edge detection in an image-to-image fashion.
RCF can automatically learn to combine complementary information from all layers of CNNs and thus can obtain accurate representations for objects or object parts in different scales.
The evaluation results demonstrate RCF performs very well on edge detection.

图1所示,我们使用HED架构[16]的VGG16[11]构建了一个简单的网络来产生中间层的侧输出。
Holistically-nested edge detection整体-嵌套的边缘检测
S. Xie and Z. Tu, “Holistically-nested edge detection,” Int. J. Comput. Vis., vol. 125, no. 1-3, pp. 3–18, 2017.
我们可以看到,通过不同的卷积得到的信息(即conv)层逐渐变得粗糙
更重要的是,中间的conv层包含了必要的细节
然而,以往的CNN架构只使用最终的conv层神经网络池化层之前的层,而忽略了中间层
另一方面,丰富的卷积特征对于许多视觉任务都是非常有效的,因此许多研究者致力于开发更深层次的网络[17]。
然而,由于梯度的消失/爆炸训练数据的不足(如边缘检测),使得网络在深入时很难收敛。
那么,我们为什么不充分利用现在的CNN特点呢?
基于这些观察,我们提出了一种更丰富的卷积特征(RCF),这是一种新的深度结构,它充分利用了所有conv层的CNN特征,以图像对图像的方式对边缘检测进行像素级预测
RCF可以自动学习并结合来自所有CNNs层的互补信息,从而获得对象或对象部件在不同尺度上的精确表示。
评价结果表明,RCF在边缘检测方面有很好的效果。

After the publication of the conference version [1], our proposed RCF edges have been widely used in weakly supervised semantic segmentation [18], style transfer [19], and stereo matching [20].
Besides, the idea of utilizing all the conv layers in a unified framework can be potentially generalized to other vision tasks. This has been demonstrated in skeleton detection [21], medial axis detection [22], people detection [23], and surface fatigue crack identification [24].

会议版本[1]发布后,我们提出的RCF边被广泛应用于弱监督语义分割[18],风格转换[19],立体匹配[20]。
此外,在一个统一的框架中利用所有的conv层的想法可以潜在地推广到其他视觉任务。这已经在骨骼检测[21],中轴检测[22],人检测[23],表面疲劳裂纹识别[24]中得到了证明。

When evaluating our method on BSDS500 dataset [9] for edge detection, we achieve a good trade-off between effectiveness and efficiency with the ODS F-measure of 0.811 and the speed of 8 FPS.
It even outperforms human perception (ODS F-measure 0.803).
In addition, a fast version of RCF is also presented, which achieves ODS F-measure of 0.806 with 30 FPS.
When applying our RCF edges to classic image segmentation, we can obtain high- quality perceptual regions as well.

在对BSDS500数据集[9]进行边缘检测时,我们在ODS F-measure为0.811,速度为8 FPS的情况下,取得了很好的效果和效率。
它甚至超过了人类的感知能力(ODS F-measure 0.803)。
此外,还提出了RCF的一个快速版本,用30 FPS实现了ODS F-measure 0.806
将RCF边应用于经典图像分割时,同样可以获得高质量的感知区域。

RELATED WORK

As one of the most fundamental problem in computer vision, edge detection has been extensively studied for several decades. 
Early pioneering methods mainly focus on the utilization of intensity and color gradients, such as Canny [25].
However, these early methods are usually not accurate enough for real-life applications.
To this end, feature learning based methods have been proposed.
These methods, such as Pb [8], gPb [9], and SE [10], usually employ sophisticated learning paradigms to predict edge strength with low-level features such as intensity, gradient, and texture.
Although these methods are shown to be promising in some cases, these handcrafted features have limited ability to represent high-level information for semantically meaningful edge detection.

边缘检测作为计算机视觉中最基本的问题之一,已被广泛研究了几十年。
早期的开拓方法主要集中在利用强度和颜色梯度,如Canny[25]。
然而,这些早期的方法在实际应用中通常不够精确。
为此,提出了基于特征学习的方法。
这些方法,如Pb[8]、gPb[9]和SE[10],通常使用复杂的学习范式来预测边缘强度,并具有强度、梯度和纹理等低层次特征。
尽管这些方法在某些情况下被证明是有前途的,但这些手工制作的特性在表示语义上有意义的边缘检测的高级信息方面能力有限。

Deep learning based algorithms have made vast inroads into many computer vision tasks.
Under this umbrella, many deep edge detectors have been introduced recently.
Ganin et al. [14] proposed N4-Fields that combines CNNs with the nearest neighbor search.
Shen et al. [15] partitioned contour data into subclasses and fitted each subclass by learning the model parameters.
Recently, Xie et al. [16] developed an efficient and accurate edge detector, HED, which performs image-to-image training and prediction.
This holistically-nested architecture connects their side output layers, which is composed of one conv layer with kernel size 1, one deconv layer, and one softmax layer, to the last conv layer of each stage in VGG16 [11].
Moreover, Liu et al. [26] used relaxed labels generated by bottom-up edges to guide the training process of HED.
Wang et al. [27] leveraged a top-down backward refinement pathway to effectively learn crisp boundaries.
Xu et al. [28] introduced a hierarchical deep model to robustly fuse the edge representations learned at different scales.
Yu et al. [29] extended the success in edge detection to semantic edge detection which simultaneously detected and recognized the semantic categories of edge pixels.

基于深度学习的算法已经在许多计算机视觉任务中取得了巨大的进展。
在这一框架下,最近出现了许多深边缘探测器。
Ganin等人提出了将CNNs与最近邻居搜索相结合的N4-Fields。
Shen等人通过学习模型参数,将轮廓数据划分为子类,并对每个子类进行拟合。
近年来,Xie等人开发了一种高效、准确的边缘检测器HED,用于图像间的训练和预测。
这种完整嵌套的体系结构将它们的边输出层连接到VGG16[11]中每个阶段的最后一个conv层,该输出层由一个conv层(内核大小为1)、一个deconv层和一个softmax层组成。
Liu等人利用自底向上边缘生成的松弛标签来指导HED的训练过程。
Wang等人利用自顶向下的向后细化路径来有效地学习清晰的边界。
Xu等人提出了一个层次化的深度模型来鲁棒地融合在不同尺度下学习的边缘表示。
Yu等人将边缘检测的成功扩展到语义边缘检测,同时检测和识别边缘像素的语义类别。

Although these aforementioned CNN-based models have pushed the state of the arts to some extent, they all turn out to be lacking in our view because that they are not able to fully exploit the rich feature hierarchies from CNNs.
These methods usually adopt CNN features only from the last layer of each conv stage.
To address this, we propose a fully convolutional network to combine features from all conv layers efficiently.

虽然上述这些基于cnn的模型在一定程度上推动了技术的发展,但在我们看来,它们都是缺乏的,因为它们不能充分利用CNNs丰富的特性层次结构。
这些方法通常只采用每个conv阶段最后一层的CNN特征。
为了解决这个问题,我们提出了一个全卷积网络来有效地结合所有conv层的特性。

3 RICHER CONVOLUTIONAL FEATURES (RCF)

3.1 network architecture

We take inspirations from existing work [12], [16] and embark on the VGG16 network [11].
VGG16 network composes of 13 conv layers and 3 fully connected layers.
Its conv layers are divided into five stages, in which a pooling layer is connected after each stage.
The useful information captured by each conv layer becomes coarser with its receptive field size increasing.
Detailed receptive field sizes of different layers can be found in [16].
The use of this rich hierarchical information is hypothesized to help edge detection.
The starting point of our network design lies here.

The novel network introduced by us is shown in Fig. 2.
Compared with VGG16, our modifications can be summarized as following:

我们从现有的工作[12]、[16]中获得灵感,并着手于VGG16网络[11]。
VGG16网络由13个conv层3个全连通层组成。
它的conv层分为五个阶段,每个阶段之后连接一个池化层。
每个conv层捕获的有用信息随着其接受域大小的增加而变得更粗糙。
[16]中可以找到不同层的详细的接受域大小
利用这些丰富的层次信息来帮助边缘检测。
我们网络设计的起点就在这里。

1
We cut all the fully connected layers and the pool5 layer.
On the one side, we remove the fully connected layers to have a fully convolutional network for an image-to- image prediction. On the other hand, adding pool5 layer will increase the stride by two times, which usually leads to degeneration of edge localization.

Each conv layer in VGG16 is connected to a conv layer with kernel size 1 × 1 and channel depth 21.
And the resulting feature maps in each stage are accumulated using an eltwise layer to attain hybrid features.

An 1 × 1 − 1 conv layer follows each eltwise layer.
Then, a deconv layer is used to up-sample this feature map.

cross-entropy loss/sigmoid layer is connected to the up-sampling layer in each stage.

All the up-sampling layers are concatenated.
Then an 1 × 1 conv layer is used to fuse feature maps from each stage.
At last, a cross-entropy loss/sigmoid layer is followed to get the fusion loss/output.

1
我们切断了所有全连接层pool5层
一方面,我们删除了全连接层,以获得一个全卷积的网络来进行图像到图像的预测。
另一方面,加入pool5层会使步幅增加2倍,这通常会导致边缘定位退化
2
VGG16中的每个conv层都连接到一个conv层,该conv层的内核尺寸为1×1,通道深度为21
每个阶段的最终特征图通过eltwise层进行累积,得到混合特征
3
一个1×1−1 conv层跟随每个eltwise层
然后,使用deconv层对这个特征图进行上采样
4
在每一阶段都将交叉熵损失/sigmoid形层连接到上采样层
5
将所有向上采样层连接起来。
然后利用1×1的conv层融合各个阶段的feature map
最后通过交叉熵损失/sigmoid形层得到融合损失/输出。

In RCF, features from all conv layers are well-encapsulated into a final representation in a holistic manner which is amenable to training by back-propagation.
As receptive field sizes of conv layers in VGG16 are different from each other, RCF endows a better mechanism than existing ones to learn multiscale information coming from all levels of convolutional features which we believe are all pertinent for edge detection.
In RCF, high-level features are coarser and can obtain strong response at the larger object or object part boundaries as illustrated in Fig. 1
while features from lower-part of CNNs are still beneficial in providing complementary fine details.

在RCF中,来自所有conv层的特征以一种整体的方式被很好地封装到最终表示中,这种方式可以通过反向传播进行训练。
由于VGG16中conv层的接受域大小各不相同,RCF提供了一种比现有机制更好的机制来学习来自各级卷积特征的多尺度信息,我们认为这些信息都与边缘检测相关。
在RCF中,高级特征比较粗糙,可以在较大的对象或对象部分边界处获得较强的响应,如图1所示,
而来自低级CNNs的特征仍然有助于提供互补的精细细节。

3.2 Annotator-robust Loss Function

Edge datasets in this community are usually labeled by several annotators using their knowledge about the presence of objects or object parts.
Though humans vary in cognition, these human-labeled edges for the same image share high consistency [8]
For each image, we average all the ground truth to generate an edge probability map, which ranges from 0 to 1.
Here, 0 means no annotator labeled at this pixel, and 1 means all annotators have labeled at this pixel.
We consider the pixels with edge probabilities higher than η as positive samples and the pixels with edge probabilities equal to 0 as negative samples.
Otherwise, if a pixel is marked by fewer than η of the annotators, this pixel may be semantically controversial to be an edge point.
Thus, regarding those pixels as either positive or negative samples may confuse the networks.
Hence we ignore them, but HED tasks them as negative samples and uses a fix η of 0.5.

这个社区中的边缘数据集通常由几个注释者使用他们关于对象或对象部件存在的知识来标记。
虽然人类的认知能力各不相同,但这些人类标记的边缘对于相同的图像具有高一致性[8]
对于每一幅图像,我们将所有的ground truth进行平均,生成一个范围为0到1的边缘概率图
这里,0表示没有在这个像素处标记的注释器,1表示所有注释器都在这个像素处标记了。
我们考虑和边缘像素的概率高于η作为正样品像素边缘概率等于0负样本
否则,如果少于η像素标记的注释器,这个像素可能语义上有争议的一个边缘点。
因此,将这些像素视为正样本或负样本可能会混淆网络。
因此我们忽略它们,但是HEDtasks 把其设置为负样本并使用修正参数η为0.5

We compute the loss of each pixel with respect to its label as


in which

Y +和Y− denote the positive sample set and the negative sample set, respectively.
The hyper-parameter λ is used to balance the number of positive and negative samples.
The activation value (CNN feature vector) and ground truth edge probability at pixel i are presented by  , respectively. 

我们计算每个像素相对于其标签的损失

其中

Y +和Y−分别表示正样本集负样本集
超参数λ是用来平衡正负样本的数量。
激活值(CNN特征向量)和像素i处的边缘真值概率分别由Xiyi表示。

 

 

 

 

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

计算机视觉-Archer

图像分割没有团队的同学可加群

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值