PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
(Robotics: Science and Systems 2018)
原文:https://arxiv.org/abs/1711.00199
代码和数据集: https://rse-lab.cs.washington.edu/projects/posecnn/.
GitHub: https://github.com/yuxng/PoseCNN
Abstract—Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rselab.cs.washington.edu/projects/posecnn/.
1、介绍了一种新的用于6D物体姿态估计的卷积神经网络PoseCNN
2、PoseCNN通过在图像中定位物体的中心并预测其与相机的距离来估计物体的3D平移
3、通过回归到四元数表示来估计对象的3D旋转
4、还引入了一种新的损失函数,使PoseCNN能够处理对称对象。
5、数据集,YCB-Video dataset,the OccludedLINEMOD dataset
1 Introduction
Recognizing objects and estimating their poses in 3D has a wide range of applications in robotic tasks. For instance, recognizing the 3D location and orientation of objects is important for robot manipulation. It is also useful in humanrobot interaction tasks such as learning from demonstration.
However, the problem is challenging due to the variety of objects in the real world. They have different 3D shapes, and their appearances on images are affected by lighting conditions, clutter in the scene and occlusions between objects.
在3D中识别物体并估计其姿态在机器人任务中有广泛的应用。例如,识别物体的3D位置和方向对于机器人操作很重要。它在人机交互任务中也很有用,例如从演示中学习。然而,由于现实世界中对象的多样性,这个问题具有挑战性。它们具有不同的3D形状,它们在图像上的外观受照明条件、场景中的杂波和物体之间的遮挡的影响。
Traditionally, the problem of 6D object pose estimation is tackled by matching feature points between 3D models and images [20, 25, 8]. However, these methods require that there are rich textures on the objects in order to detect feature points for matching. As a result, they are unable to handle texture-less objects. With the emergence of depth cameras, several methods have been proposed for recognizing textureless objects using RGB-D data [13, 3, 2, 26, 15]. For templatebased methods [13, 12], occlusions significantly reduce the recognition performance. Alternatively, methods that perform learning to regress image pixels to 3D object coordinates in order to establish the 2D-3D correspondences for 6D pose estimation [3, 4] cannot handle symmetric objects.
传统上,6D物体姿态估计的问题是通过匹配3D模型和图像之间的特征点来解决的[20,25,8]。然而,这些方法要求对象上有丰富的纹理,以便检测匹配的特征点。因此,它们无法处理无纹理的对象。随着深度相机的出现,已经提出了几种使用RGB-D数据识别无纹理对象的方法[13,3,2,26,15]。对于基于模板的方法[13,12],遮挡显著降低了识别性能。或者,执行学习以将图像像素回归到3D对象坐标以建立6D姿态估计的2D-3D对应关系[3,4]的方法不能处理对称对象。
In this work, we propose a generic framework for 6D object pose estimation where we attempt to overcome the limitations of existing methods. We introduce a novel Convolutional Neural Network (CNN) for end-to-end 6D pose estimation named PoseCNN. A key idea behind PoseCNN is to
decouple the pose estimation task into different components, which enables the network to explicitly model the dependencies and independencies between them. Specifically, PoseCNN performs three related tasks as illustrated in Fig. 1. First, it predicts an object label for each pixel in the input image.
Second, it estimates the 2D pixel coordinates of the object center by predicting a unit vector from each pixel towards the center. Using the semantic labels, image pixels associated with an object vote on the object center location in the image. In addition, the network also estimates the distance of the object center. Assuming known camera intrinsics, estimation of the 2D object center and its distance enables us to recover its 3D translation T. Finally, the 3D Rotation R is estimated by regressing convolutional features extracted inside the bounding box of the object to a quaternion representation of R. As we will show, the 2D center voting followed by rotation regression to estimate R and T can be applied to textured/texture-less objects and is robust to occlusions since the network is trained to vote on object centers even when they are occluded.
在这项工作中,我们提出了一个通用的6D物体姿态估计框架,试图克服这些现有方法的限制。我们介绍了一种新的卷积神经网络(CNN),用于端到端6D姿态估计,称为PoseCNN。PoseCNN背后的一个关键思想是将姿态估计任务解耦为不同的组件,这使得网络能够明确地建模它们之间的依赖性和独立性。具体而言,PoseCNN执行三项相关任务,如图1所示。首先,它预测输入图像中每个像素的对象标签。其次,它通过预测从每个像素到中心的单位向量来估计对象中心的2D像素坐标。使用语义标签,与对象相关联的图像像素对图像中的对象中心位置进行投票。此外,网络还估计对象中心的距离。假设已知的相机内部特性,2D对象中心及其距离的估计使我们能够恢复其3D平移T。最后,通过将在对象边界框内提取的卷积特征回归到R的四元数表示来估计3D旋转R,2D中心投票和旋转回归以估计R和T可以应用于无纹理/无纹理对象,并且对遮挡是鲁棒的。
图1. 我们提出了一种新的用于6D物体姿态估计的PoseCNN,其中训练网络执行三项任务:语义标记、3D平移估计和3D旋转回归。
Handling symmetric objects is another challenge for pose estimation, since different object orientations may generate identical observations. For instance, it is not possible to uniquely estimate the orientation of the red bowl or the wood block shown in Fig. 5. While pose benchmark datasets such as the OccludedLINEMOD dataset [17] consider a special symmetric evaluation for such objects, symmetries are typically ignored during network training. However, this can result in bad training performance since a network receives inconsistent loss signals, such as a high loss on an object orientation even though the estimation from the network is correct with respect to the symmetry of the object. Inspired by this observation, we introduce ShapeMatch-Loss, a new loss function that focuses on matching the 3D shape of an object. We will show that this loss function produces superior estimation for objects with shape symmetries.
处理对称物体是姿态估计的另一个挑战,因为不同的物体方向可能会产生相同的观测结果。例如,不可能唯一地估计图5所示的红碗或木块的方向。虽然姿势基准数据集(如OccludedLINEMOD数据集[17])考虑了此类对象的特殊对称评估,但在网络训练期间通常会忽略对称性。然而,这可能导致不良的训练性能,因为网络接收到不一致的arXiv:11711.00199v3[cs.CV]2018年5月26日丢失信号,例如对象方向上的高丢失,即使来自网络的估计相对于对象的对称性是正确的。受这一观察的启发,我们引入了ShapeMatch Loss,这是一种新的损失函数,专注于匹配对象的3D形状。我们将证明,这种损失函数对形状对称的物体产生了更好的估计。
We evaluate our method on the OccludedLINEMOD dataset [17], a benchmark dataset for 6D pose estimation. On this challenging dataset, PoseCNN achieves state-of-the-art results for both color only and RGB-D pose estimation (we use depth images in the Iterative Closest Point (ICP) algorithm for pose refinement). To thoroughly evaluate our method, we additionally collected a large scale RGB-D video dataset named YCB-Video, which contains 6D poses of 21 objects from the YCB object set [5] in 92 videos with a total of 133,827 frames. Objects in the dataset exhibit different
symmetries and are arranged in various poses and spatial configurations, generating severe occlusions between them.
我们在OccludedLINEMOD数据集[17]上评估了我们的方法,该数据集是6D姿态估计的基准数据集。在这个具有挑战性的数据集上,PoseCNN实现了仅彩色和RGB-D姿态估计的最新结果(我们在迭代最近点(ICP)算法中使用深度图像进行姿态细化)。为了彻底评估我们的方法,我们还收集了一个名为YCB-video的大规模RGB-D视频数据集,其中包含来自YCB对象集[5]的21个对象的6D个姿势,共92个视频,共133827帧。数据集中的对象表现出不同的对称性,并以不同的姿态和空间配置排列,在它们之间产生严重的遮挡。
In summary, our work has the following key contributions:
• We propose a novel convolutional neural network for 6D object pose estimation named PoseCNN. Our network achieves end-to-end 6D pose estimation and is very robust to occlusions between objects.
• We introduce ShapeMatch-Loss, a new training loss function for pose estimation of symmetric objects.
• We contribute a large scale RGB-D video dataset for 6D object pose estimation, where we provide 6D pose annotations for 21 YCB objects.
总之,我们的工作有以下关键贡献:
•我们提出了一种用于6D物体姿态估计的卷积神经网络,名为PoseCNN。我们的网络实现了端到端的6D姿态估计,并且对对象之间的遮挡非常鲁棒。
•我们引入了ShapeMatch损失,这是一种用于对称对象姿态估计的新训练损失函数。
•我们为6D对象姿态估计提供了大规模RGB-D视频数据集,其中我们为21个YCB对象提供了6D姿态注释。
This paper is organized as follows. After discussing related work, we introduce PoseCNN for 6D object pose estimation,followed by experimental results and a conclusion.
本文组织如下。在讨论了相关工作之后,我们介绍了用于6D物体姿态估计的PoseCNN,然后给出了实验结果和结论。
2 RELATED WORK
6D object pose estimation methods in the literature can be roughly classified into template-based methods and featurebased methods. In template-based methods, a rigid template is constructed and used to scan different locations in the input image. At each location, a similarity score is computed, and the best match is obtained by comparing these similarity scores [12, 13, 6]. In 6D pose estimation, a template is usually obtained by rendering the corresponding 3D model. Recently,
2D object detection methods are used as template matching and augmented for 6D pose estimation, especially with deep learning-based object detectors [28, 23, 16, 29]. Templatebased methods are useful in detecting texture-less objects. However, they cannot handle occlusions between objects very well, since the template will have low similarity score if the object is occluded.
文献中的6D物体姿态估计方法可大致分为基于模板的方法和基于特征的方法。在基于模板的方法中,构建刚性模板,并使用该模板扫描输入图像中的不同位置。在每个位置,计算相似度得分,并通过比较这些相似度得分来获得最佳匹配[12,13,6]。在6D姿态估计中,通常通过渲染相应的3D模型来获得模板。最近,2D物体检测方法被用作模板匹配,并被用于6D姿态估计,尤其是使用基于深度学习的物体检测器[28,23,16,29]。基于模板的方法在检测无纹理对象时非常有用。然而,它们不能很好地处理对象之间的遮挡,因为如果对象被遮挡,模板将具有较低的相似度分数。
In feature-based methods, local features are extracted from either points of interest or every pixel in the image and matched to features on the 3D models to establish the 2D-3D correspondences, from which 6D poses can be recovered [20, 25, 30, 22]. Feature-based methods are able to handle
occlusions between objects. However, they require sufficient textures on the objects in order to compute the local features.To deal with texture-less objects, several methods are proposed to learn feature descriptors using machine learning techniques [32, 10]. A few approaches have been proposed to directly regress to 3D object coordinate location for each pixel to establish the 2D-3D correspondences [3, 17, 4]. But 3D coordinate regression encounters ambiguities in dealing with
symmetric objects.
在基于特征的方法中,从感兴趣点或图像中的每个像素提取局部特征,并将其与3D模型上的特征相匹配,以建立2D3D对应关系,从中可以恢复6D姿态[20,25,30,22]。基于特征的方法能够处理对象之间的遮挡。然而,它们需要对象上有足够的纹理来计算局部特征。为了处理无纹理对象,提出了几种使用机器学习技术学习特征描述符的方法[32,10]。已经提出了一些方法来直接回归到每个像素的3D对象坐标位置,以建立2D-3D对应关系[3,17,4]。但三维坐标回归在处理对称对象时遇到了歧义。
In this work, we combine the advantages of both template-based methods and feature-based methods in a deep learning framework, where the network combines bottom-up pixel-wise labeling with top-down object pose regression. Recently, the 6D object pose estimation problem has received more attention thanks to the competition in the Amazon Picking Challenge (APC). Several datasets and approaches have been introduced for the specific setting in the APC [24, 35]. Our network has the potential to be applied to the APC setting as long as the appropriate training data is provided.
在这项工作中,我们在深度学习框架中结合了基于模板的方法和基于特征的方法的优点,其中网络将自下而上的逐像素标记与自上而下的对象姿态回归相结合。最近,由于亚马逊采摘挑战赛(APC)的竞争,6D物体姿态估计问题受到了更多的关注。已经为APC中的特定设置引入了几个数据集和方法[24,35]。只要提供适当的训练数据,我们的网络就有可能应用于APC设置。
3 POSECNN
Given an input image, the task of 6D object pose estimation is to estimate the rigid transformation from the object coordinate system O to the camera coordinate system C. We assume that the 3D model of the object is available and the object coordinate system is defined in the 3D space of the model. The rigid transformation here consists of an SE(3) transform containing a 3D rotation R and a 3D translation T, where R specifies the rotation angles around the X-axis, Y -axis and Z-axis of the object coordinate system O, and T is the coordinate of the origin of O in the camera coordinate system C. In the imaging process, T determines the object location and scale in the image, while R affects the image appearance of the object according to the 3D shape and texture of the object.
Since these two parameters have distinct visual properties, we propose a convolutional neural network architecture that internally decouples the estimation of R and T.
给定输入图像,6D物体姿态估计的任务是估计从物体坐标系O到相机坐标系C的刚性变换。我们假设物体的3D模型是可用的,并且物体坐标系是在模型的3D空间中定义的。这里的刚性变换包括SE(3)变换,该变换包含3D旋转R和3D平移T,其中R指定围绕对象坐标系O的X轴、Y轴和Z轴的旋转角度,T是相机坐标系C中O原点的坐标。在成像过程中,T确定图像中的对象位置和比例,而R根据对象的3D形状和纹理影响对象的图像外观。由于这两个参数具有不同的视觉特性,我们提出了一种卷积神经网络结构,该结构在内部对R和T的估计进行解耦。
A. Overview of the Network
Fig. 2 illustrates the architecture of our network for 6D object pose estimation. The network contains two stages. The first stage consists of 13 convolutional layers and 4 maxpooling layers, which extract feature maps with different resolutions from the input image. This stage is the backbone of the network since the extracted features are shared across all the tasks performed by the network. The second stage consists of an embedding step that embeds the high-dimensional feature maps generated by thefirst stage into low-dimensional, task-specific features. Then, the network performs three different tasks that lead to the 6D pose estimation, i.e., semantic labeling, 3D translation estimation, and 3D rotation regression, as described next.
图2说明了我们用于6D物体姿态估计的网络结构。网络包含两个阶段。第一阶段由13个卷积层和4个最大池层组成,它们从输入图像中提取具有不同分辨率的特征图。该阶段是网络的主干,因为提取的特征在网络执行的所有任务中共享。第二阶段包括嵌入步骤,该步骤将第一阶段生成的高维特征图嵌入到低维任务特定特征中。然后,网络执行导致6D姿态估计的三个不同任务,即语义标记、3D平移估计和3D旋转回归,如下文所述。
B. Semantic Labeling
In order to detect objects in images, we resort to semantic labeling, where the network classifies each image pixel into an object class. Compared to recent 6D pose estimation methods that resort to object detection with bounding boxes [23, 16,29], semantic labeling provides richer information about the objects and handles occlusions better.
为了检测图像中的对象,我们求助于语义标记,其中网络将每个图像像素分类为对象类。与最近使用边界框进行物体检测的6D姿态估计方法相比 [23, 16,29],语义标记提供了关于对象的更丰富的信息,并更好地处理遮挡。
The embedding step of the semantic labeling branch, as shown in Fig. 2, takes two feature maps with channel dimension 512 generated by the feature extraction stage as inputs.The resolutions of the two feature maps are 1/8 and 1/16 of the original image size, respectively. The networkfirst reduces the channel dimension of the two feature maps to 64 using two convolutional layers. Then it doubles the resolution of the 1/16 feature map with a deconvolutional layer. After that, the two feature maps are summed and another deconvolutional layer is used to increase the resolution by 8 times in order to obtain a feature map with the original image size. Finally, a convolutional layer operates on the feature map and generates semantic labeling scores for pixels. The output of this layer has n channels with n the number of the semantic classes. In training, a softmax cross entropy loss is applied to train the semantic labeling branch. While in testing, a softmax function is used to compute the class probabilities of the pixels. The design of the semantic labeling branch is inspired by the fully
convolutional network in [19] for semantic labeling. It is also used in our previous work for scene labeling [34].
如图2所示,语义标记分支的嵌入步骤将特征提取阶段生成的具有通道维度512的两个特征图作为输入。两幅特征图的分辨率分别为原始图像大小的1/8和1/16。该网络首先使用两个卷积层将两个特征图的信道维度降低到64。然后,它使用反进化层将1/16特征图的分辨率提高了一倍。之后,将两个特征图相加,并使用另一个去卷积层将分辨率提高8倍,以获得具有原始图像大小的特征图。最后,卷积层对特征图进行操作,并生成像素的语义标记分数。该层的输出有n个通道,其中n个是语义类的数量。在训练中,应用软最大交叉熵损失来训练语义标记分支。在测试中,使用softmax函数来计算像素的类概率。语义标记分支的设计灵感来自[19]中用于语义标记的完全卷积网络。它也用于我们之前的