[3D分割 Benchmak] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes

最新推荐文章于 2024-06-17 09:44:02 发布

john_bh

最新推荐文章于 2024-06-17 09:44:02 发布

阅读量2.9k

点赞数 1

分类专栏： 3D 点云文章标签： SanNet 3D分割 RGB-D Benchmark

本文链接：https://blog.csdn.net/john_bh/article/details/104000701

版权

3D 点云专栏收录该内容

8 篇文章 21 订阅

订阅专栏

转载请注明作者和出处： http://blog.csdn.net/john_bh/

论文链接：ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
作者及团队：斯坦福大学
会议及时间：CVPR 2017
code：github链接

文章目录

Abstract

A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding,very little data is available – current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions,and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation.We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification,semantic voxel labeling, and CAD model retrieval.
运用有监督的深度学习方法的关键要求是要提供大型的，经过标记的数据集。不幸的是，在RGB-D场景理解的背景下，几乎没有可用的数据–当前的数据集覆盖了很小范围的场景视图，并且语义注释有限为解决此问题，我们引入了ScanNet，这是一个RGB-D视频数据集，其中包含1513个场景中的2.5M视图，并标注了3D相机姿势，表面重建和语义分割。为了收集这些数据，我们设计了一个易于使用且可扩展的RGB-D捕获系统，该系统包括自动表面重建和众包语义标注。我们证明了使用这些数据有助于在多个3D场景上实现最新的性能了解任务，包括3D对象分类，语义体素标签和CAD模型检索。

1.Introduction

在这里插入图片描述
Since the introduction of commodity RGB-D sensors,such as the Microsoft Kinect, the field of 3D geometry capture has gained significant attention and opened up a wide range of new applications. Although there has been significant effort on 3D reconstruction algorithms, general 3D scene understanding with RGB-D data has only very recently started to become popular. Research along semantic understanding is also heavily facilitated by the rapid progress of modern machine learning methods, such as neural models. One key to successfully applying theses approaches is the availability of large, labeled datasets. While much effort has been made on 2D datasets [17, 44, 47],where images can be downloaded from the web and directly annotated, the situation for 3D data is more challenging.Thus, many of the current RGB-D datasets [74, 92, 77, 32] are orders of magnitude smaller than their 2D counterparts.Typically, 3D deep learning methods use synthetic data to mitigate this lack of real-world data [91, 6].
自从引入诸如Microsoft Kinect之类的商用RGB-D传感器以来，3D几何捕获领域就受到了广泛关注，并开辟了广泛的新应用。尽管在3D重建算法上已经投入了大量精力，但是对RGB-D数据的一般3D场景理解直到最近才开始流行。现代机器学习方法（例如神经模型）的迅速发展也极大地促进了基于语义理解的研究。成功应用这些方法的一个关键是大型，标记数据集的可用性。尽管已经在2D数据集[17、44、47]上进行了很多努力，可以从Web上下载图像并直接对其进行注释，但是3D数据的情况更具挑战性。因此，许多当前的RGB-D数据集[74] [92，77，32]比其2D同类产品小几个数量级。通常，3D深度学习方法使用合成数据来缓解现实世界数据的不足[91，6]。

One of the reasons that current 3D datasets are small is because their capture requires much more effort, and efficiently providing (dense) annotations in 3D is non-trivial.Thus, existing work on 3D datasets often fall back to polygon or bounding box annotations on 2.5D RGB-D images [74, 92, 77], rather than directly annotating in 3D. In the latter case, labels are added manually by expert users (typically by the paper authors) [32, 71] which limits their overall size and scalability.
当前3D数据集较小的原因之一是它们的捕获需要更多的工作，并且在3D中高效提供（密集）注释是不平凡的，因此，有关3D数据集的现有工作通常会退回到多边形或边界框注释上2.5D RGB-D图像[74、92、77]，而不是直接在3D中进行注释。在后一种情况下，标签是由专家用户（通常由论文作者）手动添加的[32，71]，这限制了标签的整体大小和可伸缩性。

In this paper, we introduce ScanNet, a dataset of richlyannotated RGB-D scans of real-world environments containing 2.5M RGB-D images in 1513 scans acquired in 707 distinct spaces. The sheer magnitude of this dataset is larger than any other [58, 81, 92, 75, 3, 71, 32]. However,what makes it particularly valuable for research in scene understanding is its annotation with estimated calibration parameters, camera poses, 3D surface reconstructions, textured meshes, dense object-level semantic segmentations,and aligned CAD models (see Fig. 2). The semantic segmentations are more than an order of magnitude larger than any previous RGB-D dataset.
在本文中，我们介绍了ScanNet，它是在707个不同空间中进行的1513次扫描中包含2.5M RGB-D图像的真实环境的富注释RGB-D扫描的数据集。该数据集的绝对大小大于任何其他数据[58、81、92、75、3、71、32]。但是，对场景理解研究特别有价值的是它的注释，包括估计的校准参数，相机姿态，3D表面重建，纹理化网格，密集的对象级语义分割和对齐的CAD模型（见图2）。语义分割比任何以前的RGB-D数据集大一个数量级。
在这里插入图片描述
In the collection of this dataset, we have considered two main research questions: 1) how can we design a framework that allows many people to collect and annotate large amounts of RGB-D data,and 2) can we use the rich annotations and data quantity provided in ScanNet to learn better 3D models for scene understanding?
在此数据集的收集中，我们考虑了两个主要的研究问题：1）我们如何设计一个允许许多人收集和注释大量RGB-D数据的框架，以及2）我们可以使用丰富的注释和数据ScanNet中提供的数量，以更好地了解场景的3D模型？

To investigate the first question, we built a capture pipeline to help novices acquire semantically-labeled 3D models of scenes. A person uses an app on an iPad mounted with a depth camera to acquire RGB-D video,and then we processes the data off-line and return a complete semantically-labeled 3D reconstruction of the scene.The challenges in developing such a framework are numerous,including how to perform 3D surface reconstruction robustly in a scalable pipeline and how to crowdsource semantic labeling. The paper discusses our study of these issues and documents our experience with scaling up RGB-D scan collection (20 people) and annotation (500 crowd workers).
为了调查第一个问题，我们构建了一个捕获管道，以帮助新手获取语义标记的场景3D模型。一个人使用装有深度摄像头的iPad上的应用程序来获取RGB-D视频，然后我们离线处理数据并返回完整的语义标记的3D场景重建。开发这样的框架面临的挑战是包括如何在可扩展的管道中稳健地执行3D表面重建以及如何众包语义标记。本文讨论了我们对这些问题的研究，并记录了我们在扩展RGB-D扫描集合（20人）和注释（500名人群工作者）方面的经验。

To investigate the second question, we trained 3D deep networks with the data provided by ScanNet and tested their performance on several scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval. For the semantic voxel labeling task, we introduce a new volumetric CNN architecture.
为了调查第二个问题，我们用ScanNet提供的数据训练了3D深度网络，并测试了它们在几种场景理解任务上的性能，包括3D对象分类，语义体素标记和CAD模型检索。对于语义体素标记任务，我们介绍了一种新的体积CNN架构。

Overall, the contributions of this paper are:
总体而言，本文的贡献是：

A large 3D dataset containing 1513 RGB-D scans of over 707 unique indoor environments with estimated camera parameters, surface reconstructions, textured meshes, semantic segmentations. We also provide CAD model placements for a subset of the scans.
A design for efficient 3D data capture and annotation suitable for novice users.
New RGB-D benchmarks and improved results for state-of-the art machine learning methods on 3D object classification, semantic voxel labeling, and CAD model retrieval.
A complete open source acquisition and annotation framework for dense RGB-D reconstructions.
大型3D数据集，包含对707多个独特室内环境的1513次RGB-D扫描，并具有估计的相机参数，表面重建，纹理网格，语义分割。
我们还为扫描的一部分提供CAD模型放置。
一种适合新手用户的高效3D数据捕获和注释设计。
最新的RGB-D基准和3D对象分类，语义体素标记和CAD模型检索方面的最新机器学习方法的改进结果。
用于密集RGB-D重建的完整的开源获取和注释框架。

2. Previous Work

A large number of RGB-D datasets have been captured and made publicly available for training and benchmarking[56, 34, 50, 65, 79, 83, 74, 4, 58, 81, 15, 55, 1, 68, 30, 51, 21,48, 43, 92, 80, 61, 72, 93, 36, 16, 35, 57, 40, 29, 70, 52, 45,95, 75, 9, 33, 85, 71, 32, 3, 10, 78, 2].1 These datasets have been used to train models for many 3D scene understanding tasks, including semantic segmentation [67, 58, 26, 86], 3D object detection [73, 46, 27, 76, 77], 3D object classification [91, 53, 66], and others [94, 22, 23].
大量RGB-D数据集已被捕获并公开用于训练和基准测试[56、34、50、65、79、83、74、4、58、81、15、55、1、68、30， 51，21,48，43，92，80，61，72，93，36，16，35，57，40，29，70，52，45,95，75，9，33，85，71，32， 3、10、78、2] .1这些数据集已用于训练许多3D场景理解任务的模型，包括语义分割[67、58、26、86]，3D对象检测[73、46、27、76， 77]，3D对象分类[91、53、66]等[94、22、23]。
在这里插入图片描述
Most RGB-D datasets contain scans of individual objects.For example, the Redwood dataset [10] contains over 10,000 scans of objects annotated with class labels, 1,781 of which are reconstructed with KinectFusion [59]. Since the objects are scanned in isolation without scene context, the dataset’s focus is mainly on evaluating surface reconstruction quality rather than semantic understanding of complete scenes.
大多数RGB-D数据集都包含单个对象的扫描。例如，Redwood数据集[10]包含10,000多个对象的扫描，这些对象带有类标签，其中1,781个由KinectFusion [59]重建。由于对象是在没有场景上下文的情况下隔离扫描的，因此数据集的重点主要在于评估表面重建质量，而不是对完整场景的语义理解。

One of the earliest and most popular datasets for RGBD scene understanding is NYU v2 [74]. It is composed of 464 short RGB-D sequences, from which 1449 frames have been annotated with 2D polygons denoting semantic segmentations,as in LabelMe [69]. SUN RGB-D [75] follows up on this work by collecting 10,335 RGB-D frames annotated with polygons in 2D and bounding boxes in 3D.These datasets have scene diversity comparable to ours, but include only a limited range of viewpoints, and do not provide complete 3D surface reconstructions, dense 3D semantic segmentations, or a large set of CAD model alignments.
NYU v2是最早，最受欢迎的RGBD场景理解数据集[74]。它由464个短RGB-D序列组成，其中1449个帧已用2D多边形标注了语义分割，如LabelMe [69]中所示。 SUN RGB-D [75]继续这项工作，收集了10,335个RGB-D帧，这些帧带有2D中的多边形和3D中的边界框。这些数据集的场景多样性与我们的相当，但仅包含有限范围的视点，并且无法提供完整的3D表面重建，密集的3D语义分割或大量的CAD模型对齐方式。

One of the first RGB-D datasets focused on long RGBD sequences in indoor environments is SUN3D. It contains a set of 415 Kinect v1 sequences of 254 unique spaces.Although some objects were annotated manually with 2D polygons, and 8 scans have estimated camera poses based on user input, the bulk of the dataset does not include camera poses, 3D reconstructions, or semantic annotations.
SUN3D是最早关注室内环境中长RGBD序列的RGB-D数据集之一。它包含415个Kinect v1序列的集合，该序列具有254个唯一空间。尽管有些对象是使用2D多边形手动注释的，并且8次扫描已根据用户输入估算了相机姿态，但数据集的大部分不包括相机姿态，3D重建，或语义注释。

Recently, Armeni et al. [3, 2] introduced an indoor dataset containing 3D meshes for 265 rooms captured with a custom Matterport camera and manually labeled with semantic annotations. The dataset is highquality, but the cap-ture pipeline is based on expensive and less portable hardware.Furthermore, only a fused point cloud is provided as output. Due to the lack of raw color and depth data, its applicability to research on reconstruction and scene understanding from raw RGB-D input is limited.
最近，Armeni等人。 [3，2]引入了一个室内数据集，其中包含使用定制的Matterport摄像机捕获的265个房间的3D网格，并用语义注释手动标记。数据集是高质量的，但捕获管道基于昂贵且便携性较低的硬件，此外，仅提供融合点云作为输出。由于缺乏原始颜色和深度数据，因此其从原始RGB-D输入进行重建和场景理解研究的适用性有限。

The datasets most similar to ours are SceneNN [32] and PiGraphs [71], which are composed of 100 and 26 densely reconstructed and labeled scenes respectively. The annotations are done directly in 3D [60, 71]. However, both scanning and labeling are performed only by expert users (i.e. the authors), limiting the scalability of the system and the size of the dataset. In contrast, we design our RGB-D acquisition framework specifically for ease-of-use by untrained users and for scalable processing through crowdsourcing.This allows us to acquire a significantly larger dataset with more annotations (currently, 1513 sequences are reconstructed and labeled).
与我们最相似的数据集是SceneNN [32]和PiGraphs [71]，它们分别由100个和26个密集重构和标记的场景组成。注释直接在3D中完成[60，71]。但是，扫描和标记都仅由专家用户（即作者）执行，这限制了系统的可伸缩性和数据集的大小。相比之下，我们设计RGB-D采集框架的目的是为了使未经培训的用户易于使用，并通过众包进行可扩展的处理，这使我们能够获得带有更多注释的更大数据集（目前已重构并标记了1513个序列）。

3. Dataset Acquisition Framework

In this section, we focus on the design of the framework used to acquire the ScanNet dataset (Fig. 2). We discuss design trade-offs in building the framework and relay findings on which methods were found to work best for large-scale RGB-D data collection and processing.
在本节中，我们集中于用于获取ScanNet数据集的框架的设计（图2）。我们在构建框架时讨论设计权衡问题，并传达发现哪些方法最适合大规模RGB-D数据收集和处理的发现。

Our main goal driving the design of our framework was to allow untrained users to capture semantically labeled surfaces of indoor scenes with commodity hardware. Thus the RGB-D scanning system must be trivial to use, the data processing robust and automatic, the semantic annotations crowdsourced, and the flow of data through the system handled by a tracking server.
驱动框架设计的主要目标是允许未经培训的用户使用商品硬件捕获室内场景的语义标记表面。因此，RGB-D扫描系统必须简单易用，数据处理功能强大且自动，语义标注众包以及数据流经跟踪服务器处理的系统。

3.1. RGBD Scanning
Hardware. There is a spectrum of choices for RGB-D sensor hardware. Our requirement for deployment to large groups of inexperienced users necessitates a portable and low-cost RGB-D sensor setup. We use the Structure sensor[63], a commodity RGB-D sensor with design similar to the Microsoft Kinect v1. We attach this sensor to a handheld device such as an iPhone or iPad (see Fig. 2 left) — results in this paper were collected using iPad Air2 devices. The iPad RGB camera data is temporally synchronized with the depth sensor via hardware, providing synchronized depth and color capture at 30 Hz. Depth frames are captured at a resolution of 640×480 and color at 1296×968 pixels. We enable auto-white balance and auto-exposure by default.
硬件。 RGB-D传感器硬件有多种选择。我们对于部署到大量缺乏经验的用户的要求，需要一种便携式且低成本的RGB-D传感器设置。我们使用结构传感器[63]，这是一种商品RGB-D传感器，其设计类似于Microsoft Kinect v1。我们将此传感器连接到iPhone或iPad等手持设备（请参见左图2）-本文中的结果是使用iPad Air2设备收集的。 iPad RGB相机数据通过硬件在时间上与深度传感器同步，以30 Hz的频率提供同步的深度和颜色捕获。深度帧以640×480的分辨率捕获，颜色为1296×968像素。默认情况下，我们启用自动白平衡和自动曝光。

Calibration. Our use of commodity RGB-D sensors necessitates unwarping of depth data and alignment of depth and color data. Prior work has focused mostly on controlled lab conditions with more accurate equipment to inform calibration for commodity sensors (e.g., Wang et al. [87]).However, this is not practical for novice users. Thus the user only needs to print out a checkerboard pattern, place it on a large, flat surface, and capture an RGB-D sequence viewing the surface from close to far away. This sequence,as well as a set of infrared and color frame pairs viewing the checkerboard, are uploaded by the user as input to the calibration.Our system then runs a calibration procedure based on [84, 14] to obtain intrinsic parameters for both depth and color sensors, and an extrinsic transformation of depth to color. We find that this calibration procedure is easy for users and results in improved data and consequently enhanced reconstruction quality.
校准。我们使用商品RGB-D传感器需要深度数据不变形以及深度和颜色数据对齐。先前的工作主要集中在控制实验室条件，使用更精确的设备来告知商品传感器的校准方面（例如，Wang等人[87]）。但是，这对于新手用户而言并不实用。因此，用户只需要打印出一个棋盘图案，将其放置在一个大而平坦的表面上，并捕获一个RGB-D序列即可从近到远观察该表面。用户上传此序列以及查看棋盘的一组红外和彩色帧对，作为校准的输入。然后，我们的系统基于[84，14]运行校准程序，以获取两个深度的固有参数和颜色传感器，以及深度到颜色的外部转换。我们发现该校准过程对用户来说很容易，并且可以改善数据质量，从而提高重建质量。

User Interface. To make the capture process simple for untrained users, we designed an iOS app with a simple live RGB-D video capture UI (see Fig. 2 left). The user provides a name and scene type for the current scan and proceeds to record a sequence. During scanning, a log-scale RGB feature detector point metric is shown as a “featurefulness” bar to provide a rough measure of tracking robustness and reconstruction quality in different regions being scanned.This feature was critical for providing intuition to users who are not familiar with the constraints and limitations of 3D reconstruction algorithms.
用户界面。为了使未经培训的用户的捕获过程变得简单，我们设计了一个具有简单实时RGB-D视频捕获UI的iOS应用（请参见图2）。用户为当前扫描提供名称和场景类型，然后继续记录序列。在扫描过程中，对数刻度的RGB特征检测器点度量显示为“功能”条，以粗略衡量在被扫描的不同区域中的跟踪鲁棒性和重建质量。此功能对于向不熟悉的用户提供直观感至关重要受到3D重建算法的约束和限制。

Storage. We store scans as compressed RGB-D data on the device flash memory so that a stable internet connection is not required during scanning. The user can upload scans to the processing server when convenient by pressing an “upload” button. Our sensor units used 128GB iPad Air2 devices, allowing for several hours of recorded RGBD video. In practice, the bottleneck was battery life rather than storage space. Depth is recorded as 16-bit unsigned short values and stored using standard zLib compression.RGB data is encoded with the H.264 codec with a high bitrate of 15Mbps to prevent encoding artifacts. In addition to the RGB-D frames, we also record Inertial Measurement Unit (IMU) data, including acceleration, and angular velocities,from the Apple SDK. Timestamps are recorded for IMU, color, and depth images.
存储。我们将扫描结果作为压缩的RGB-D数据存储在设备闪存中，因此在扫描过程中不需要稳定的Internet连接。方便时，用户可以通过按“上载”按钮将扫描结果上载到处理服务器。我们的传感器单元使用了128GB的iPad Air2设备，可以录制数小时的RGBD视频。实际上，瓶颈是电池寿命而不是存储空间。深度被记录为16位无符号短值并使用标准zLib压缩进行存储。RGB数据使用H.264编解码器以15Mbps的高比特率进行编码，以防止编码失真。除了RGB-D帧外，我们还记录了来自Apple SDK的惯性测量单位（IMU）数据，包括加速度和角速度。记录IMU，彩色和深度图像的时间戳。

3.2. Surface Reconstruction
Once data has been uploaded from the iPad to our server, the first processing step is to estimate a denselyreconstructed 3D surface mesh and 6-DoF camera poses for all RGB-D frames. To conform with the goal for an automated and scalable framework, we choose methods that favor robustness and processing speed such that uploaded recordings can be processed at near real-time rates with little supervision.
将数据从iPad上传到我们的服务器后，第一步就是为所有RGB-D帧估计密集重构的3D表面网格和6自由度相机姿势。为了符合自动化和可扩展框架的目标，我们选择有利于鲁棒性和处理速度的方法，以便可以在几乎没有监督的情况下以接近实时的速率处理上传的录音。

Dense Reconstruction. We use volumetric fusion [11] to perform the dense reconstruction, since this approach is widely used in the context of commodity RGB-D data.There is a large variety of algorithms targeting this scenario[59, 88, 7, 62, 37, 89, 42, 9, 90, 38, 12]. We chose the BundleFusion system [12] as it was designed and evaluated for similar sensor setups as ours, and provides real-time speed while being reasonably robust given handheld RGBD video data.
密集重建。由于此方法广泛用于商品RGB-D数据，因此我们使用体积融合[11]进行密集重构。针对这种情况，有各种各样的算法[59，88，7，62，37， 89、42、9、90、38、12]。我们选择BundleFusion系统[12]是因为它是针对与我们的传感器类似的传感器设置而设计和评估的，并提供实时速度，同时在给定手持RGBD视频数据的情况下具有相当强的鲁棒性。

For each input scan, we first run BundleFusion [12] at a voxel resolution of 1 cm3. BundleFusion produces accurate pose alignments which we then use to perform volumetric integration through VoxelHashing [62] and extract a high resolution surface mesh using the Marching Cubes algorithm on the implicit TSDF (4mm3 voxels). The mesh is then automatically cleaned up with a set of filtering steps to merge close vertices, delete duplicate and isolated mesh parts, and finally to downsample the mesh to high, medium,and low resolution versions. (each level reducing the number of faces by a factor of two)。
对于每个输入扫描，我们首先以1 cm3的体素分辨率运行BundleFusion [12]。 BundleFusion产生精确的姿势对齐，然后我们通过VoxelHashing [62]进行体积积分，并使用隐式TSDF（4mm3体素）上的Marching Cubes算法提取高分辨率曲面网格。然后使用一组过滤步骤自动清理网格，以合并闭合顶点，删除重复的和隔离的网格零件，最后将网格降采样为高，中和低分辨率版本（每个级别将面孔数量减少两倍）。

Orientation. After the surface mesh is extracted, we automatically align it and all camera poses to a common coordinate frame with the z-axis as the up vector, and the xy plane aligned with the floor plane. To perform this alignment,we first extract all planar regions of sufficient size,merge regions defined by the same plane, and sort them by normal (we use a normal threshold of 25◦ and a planar offset threshold of 5 cm). We then determine a prior for the up vector by projecting the IMU gravity vectors of all frames into the coordinates of the first frame. This allows us to select the floor plane based on the scan bounding box and the normal most similar to the IMU up vector direction. Finally,we use a PCA on the mesh vertices to determine the rotation around the z-axis and translate the scan such that its bounds are within the positive octant of the coordinate system.
取向。提取表面网格后，我们将其自动对齐，并且所有摄影机均以z轴作为向上矢量，并且xy平面与地板平面对齐，并摆在一个公共坐标系上。要执行此对齐，我们首先提取所有具有足够大小的平面区域，合并由同一平面定义的区域，然后按法线对它们进行排序（我们使用25°的法线阈值和5 cm的平面偏移阈值）。然后，通过将所有帧的IMU重力矢量投影到第一帧的坐标中，确定向上矢量的先验值。这使我们能够基于扫描边界框和最类似于IMU向上矢量方向的法线来选择地板平面。最后，我们在网格顶点上使用PCA来确定围绕z轴的旋转并平移扫描，以使其边界在坐标系的正八分圆之内。

Validation. This reconstruction process is automatically triggered when a scan is uploaded to the processing server and runs unsupervised. In order to establish a clean snapshot to construct the ScanNet dataset reported in this paper,we automatically discard scan sequences that are short, have high residual reconstruction error, or have low percentage of aligned frames. We then manually check for and discard reconstructions with noticeable misalignments.
验证。当扫描上传到处理服务器并且在无人监督的情况下运行时，将自动触发此重建过程。为了建立干净的快照以构建本文报告的ScanNet数据集，我们会自动丢弃较短，残差重构误差高或对齐帧百分比低的扫描序列。然后，我们手动检查并丢弃具有明显错位的重构。

3.3. Semantic Annotation
After a reconstruction is produced by the processing server, annotation HITs (Human Intelligence Tasks) are issued on the Amazon Mechanical Turk crowdsourcing market.The two HITs that we crowdsource are: i) instancelevel object category labeling of all surfaces in the reconstruction,and ii) 3D CAD model alignment to the reconstruction.These annotations are crowdsourced using webbased interfaces to again maintain the overall scalability of the framework.
在处理服务器生成重建文件之后，将在Amazon Mechanical Turk众包市场上发布注释HIT（人类智能任务）。我们众包的两个HIT是：i）重建中所有表面的实例级对象类别标签，以及ii ）3D CAD模型与重建对齐。这些注释使用基于Web的界面众包以再次维护框架的整体可伸缩性。

Instance-level Semantic Labeling. Our first annotation step is to obtain a set of object instance-level labels directly on each reconstructed 3D surface mesh. This is in contrast to much prior work that uses 2D polygon annotations on RGB or RGB-D images, or 3D bounding box annotations.
实例级语义标签。我们的第一个注释步骤是直接在每个重建的3D表面网格上获取一组对象实例级标签。这与以前在RGB或RGB-D图像上使用2D多边形注释或3D边界框注释的先前工作形成对比。
在这里插入图片描述
We developed a WebGL interface that takes as input the low-resolution surface mesh of a given reconstruction and a conservative over-segmentation of the mesh using a normalbased graph cut method [19, 39]. The crowd worker then selects segments to annotate with instance-level object category labels (see Fig. 3). Each worker is required to annotate at least 25% of the surfaces in a reconstruction, and encouraged to annotate more than 50% before submission. Each scan is annotated by multiple workers (scans in ScanNet are annotated by 2.3 workers on average).
我们开发了一个WebGL界面，该界面使用给定重建的低分辨率曲面网格和使用基于法线的图割方法保守地对网格进行过度分割作为输入[19，39]。然后，人群工作者选择要用实例级对象类别标签进行注释的段（请参见图3）。要求每个工人在重建中至少注释25％的表面，并鼓励在提交之前注释50％以上的表面。每次扫描由多个工作人员注释（ScanNet中的扫描平均由2.3个工作人员注释）。

A key challenge in designing this interface is to enable efficient annotation by workers who have no prior experience with the task, or 3D interfaces in general. Our interface uses a simple painting metaphor where clicking and drag-ging over surfaces paints segments with a given label and corresponding color. This functions similarly to 2D painting and allows for erasing and modifying existing regions.
设计此界面的一个关键挑战是要使没有相关任务经验的工作人员或3D界面进行有效注释。我们的界面使用了一个简单的绘画隐喻，即在表面上单击和拖动来绘制具有给定标签和相应颜色的线段。此功能类似于2D绘画，并允许擦除和修改现有区域。

Another design requirement is to allow for freeform text labels, to reduce the inherent bias and scalability issues of pre-selected label lists. At the same time, it is desirable to guide users for consistency and coverage of basic object types. To achieve this, the interface provides autocomplete functionality over all labels previously provided by other workers that pass a frequency threshold (> 5 annotations).Workers are always allowed to add arbitrary text labels to ensure coverage and allow expansion of the label set.
另一个设计要求是允许使用自由格式的文本标签，以减少预选标签列表的固有偏差和可伸缩性问题。同时，期望指导用户基本对象类型的一致性和覆盖范围。为了实现这一目标，该界面为其他通过频率阈值（> 5个注释）的工作人员先前提供的所有标签提供了自动完成功能。始终允许工作人员添加任意文本标签以确保覆盖范围并允许扩展标签集。

Several additional design details are important to ensure usability by novice workers. First, a simple distance check for connectedness is used to disallow labeling of disconnected surfaces with the same label. Earlier experiments without this constraint resulted in two undesirable behaviors:cheating by painting many surfaces with a few labels,and labeling of multiple object instances with the same label.Second, the 3D nature of the data is challenging for novice users. Therefore, we first show a full turntable rotation of each reconstruction and instruct workers to change the view using a rotating turntable metaphor. Without the turntable rotation animation, many workers only annotated from the initial view and never used camera controls despite the provided instructions.
其他一些设计细节对于确保新手的可用性很重要。首先，使用简单的连接距离检查来禁止使用相同的标签标注未连接表面。没有此限制的早期实验会导致两个不良行为：通过绘制带有少量标签的多个表面进行欺骗，以及使用相同标签对多个对象实例进行标签。第二，数据的3D性质对于新手用户而言具有挑战性。因此，我们首先显示每个重建的完整转盘旋转，并指示工人使用旋转转盘隐喻更改视图。没有转盘旋转动画，尽管提供了说明，但许多工作人员仅从初始视图进行注释，并且从未使用过摄像机控制。

CAD Model Retrieval and Alignment. In the second annotation task, a crowd worker was given a reconstruction already annotated with object instances and asked to place appropriate 3D CAD models to represent major objects in the scene. The challenge of this task lies in the selection of closely matching 3D models from a large database, and in precisely aligning each model to the 3D position of the corresponding object in the reconstruction.
CAD模型检索和对齐。在第二个注释任务中，向人群工人提供了已经用对象实例注释的重建，并要求放置适当的3D CAD模型来表示场景中的主要对象。这项任务的挑战在于从大型数据库中选择紧密匹配的3D模型，以及将每个模型与重建中相应对象的3D位置精确对齐。

We implemented an assisted object retrieval interface where clicking on a previously labeled object in a reconstruction immediately searched for CAD models with the same category label in the ShapeNetCore [6] dataset, and placed one example model such that it overlaps with the oriented bounding box of the clicked object (see Fig. 4). The worker then used keyboard and mouse-based controls to adjust the alignment of the model, and was allowed to submit the task once at least three CAD models were placed.
我们实现了一个辅助对象检索界面，在该界面中，单击重建中先前标记的对象可立即在ShapeNetCore [6]数据集中搜索具有相同类别标签的CAD模型，并放置一个示例模型，使其与图框的定向边界框重叠。单击的对象（请参见图4）。然后，工作人员使用基于键盘和鼠标的控件来调整模型的对齐方式，并且一旦放置了至少三个CAD模型，就可以提交任务。
在这里插入图片描述
Using this interface, we collected sets of CAD models aligned to each ScanNet reconstruction. Preliminary results indicate that despite the challenging nature of this task, workers select semantically appropriate CAD models to match objects in the reconstructions. The main limitation of this interface is due to the mismatch between the corpus of available CAD models and the objects observed in the ScanNet scans. Despite the diversity of the ShapeNet CAD model dataset (55K objects), it is still hard to find exact instance-level matches for chairs, desks and more rare object categories. A promising way to alleviate this limitation is to algorithmically suggest candidate retrieved and aligned CAD models such that workers can perform an easier verification and adjustment task.
使用此接口，我们收集了与每个ScanNet重建对齐的CAD模型集。初步结果表明，尽管这项工作具有挑战性，但工人仍选择语义上合适的CAD模型来匹配重建中的对象。该接口的主要限制是由于可用的CAD模型的语料库与ScanNet扫描中观察到的对象之间的不匹配。尽管ShapeNet CAD模型数据集（55K个对象）是多种多样的，但是仍然很难找到椅子，书桌和更罕见的对象类别的确切实例级匹配。减轻此限制的一种有前途的方法是通过算法建议候选者检索并对齐的CAD模型，以便工人可以执行更轻松的验证和调整任务。

4. ScanNet Dataset

在这里插入图片描述
In this section, we summarize the data we collected using our framework to establish the ScanNet dataset. This dataset is a snapshot of available data from roughly one month of data acquisition by 20 users at locations in several countries. It has annotations by more than 500 crowd workers on the Mechanical Turk platform. Since the presented framework runs in an unsupervised fashion and people are continuously collecting data, this dataset continues to grow organically. Here, we report some statistics for an initial snapshot of 1513 scans, which are summarized in Table 2.
在本节中，我们总结使用框架建立ScanNet数据集收集的数据。该数据集是一些国家20位用户大约一个月的数据采集中可用数据的快照。它在Mechanical Turk平台上有500多名人群工人的注释。由于提出的框架以无人监督的方式运行，并且人们不断收集数据，因此该数据集继续有机增长。在这里，我们报告了1513次扫描的初始快照的一些统计信息，这些统计信息汇总在表2中。
在这里插入图片描述
Fig. 5 plots the distribution of scanned scenes over different types of real-world spaces. ScanNet contains a variety of spaces such as offices, apartments, and bathrooms. The dataset contains a diverse set of spaces ranging from small (e.g., bathrooms, closets, utility rooms) to large (e.g., apartments,classrooms, and libraries). Each scan has been annotated with instance-level semantic category labels through our crowdsourcing task. In total, we deployed 3,391 annotation tasks to annotate all 1513 scans.
图5绘制了不同类型的现实空间中扫描场景的分布。 ScanNet包含各种空间，例如办公室，公寓和浴室。数据集包含各种空间，范围从小（例如，浴室，壁橱，杂物间）到大（例如，公寓，教室和图书馆）。通过我们的众包任务，每次扫描都带有实例级别的语义类别标签。总共，我们部署了3391个注释任务来注释所有1513个扫描。

The text labels used by crowd workers to annotate object instances are all mapped to the object category sets of NYU v2 [58], ModelNet [91], ShapeNet [6], and WordNet [18] synsets. This mapping is made more robust by a preprocess that collapses the initial text labels through synonym and misspelling detection.
人群工作人员用来注释对象实例的文本标签都映射到NYU v2 [58]，ModelNet [91]，ShapeNet [6]和WordNet [18]同义词集的对象类别集。预处理通过同义词和拼写错误检测折叠初始文本标签，从而使此映射更加健壮。

In addition to reconstructing and annotating the 1513 ScanNet scans, we have processed all the NYU v2 RGB-D sequences with our framework. The result is a set of dense reconstructions of the NYU v2 spaces with instance-level object annotations in 3D that are complementary in nature to the existing image-based annotations.
除了重建和注释1513 ScanNet扫描外，我们还使用我们的框架处理了所有NYU v2 RGB-D序列。结果是一组带有3D实例级别对象注释的NYU v2空间的密集重构，这些注释本质上与现有的基于图像的注释互补。

We also deployed the CAD model alignment crowdsourcing task to collect a total of 107 virtual scene interpretations consisting of aligned ShapeNet models placed on a subset of 52 ScanNet scans by 106 workers. There were a total of 681 CAD model instances (of 296 unique models) retrieved and placed on the reconstructions, with an average of 6.4 CAD model instances per annotated scan.
我们还部署了CAD模型对齐众包任务，以收集总共107个虚拟场景解释，其中包括106个工人对52个ScanNet扫描子集中的对齐ShapeNet模型。总共检索到681个CAD模型实例（共296个唯一模型）并将其放置在重构中，每个带注释的扫描平均有6.4个CAD模型实例。

For more detailed statistics on this first ScanNet dataset snapshot, please see the supplemental material.
有关第一个ScanNet数据集快照的更多详细统计信息，请参阅补充材料。

5. Tasks and Benchmarks

In this section, we describe the three tasks we developed as benchmarks for demonstrating the value of ScanNet data.
在本节中，我们将介绍作为基准来展示ScanNet数据价值而开发的三个任务。
在这里插入图片描述
Train/Test split statistics. Table 3 shows the test and training splits of ScanNet in the context of the object classification and dense voxel prediction benchmarks. Note that our data is significantly larger than any existing comparable dataset. We use these tasks to demonstrate that Scan-Net enables the use of deep learning methods for 3D scene understanding tasks with supervised training, and compare performance to that using data from other existing datasets.
训练/测试拆分统计信息。表3显示了在对象分类和密集体素预测基准的情况下ScanNet的测试和培训内容。请注意，我们的数据远大于任何现有的可比较数据集。我们使用这些任务来证明Scan-Net在监督培训的支持下将深度学习方法用于3D场景理解任务，并将性能与使用其他现有数据集中的数据进行比较。

5.1. 3D Object Classification
With the availability of large-scale synthetic 3D datasets such as [91, 6] and recent advances in 3D deep learn-ing, research has developed approaches to classify objects using only geometric data with volumetric deep nets[91, 82, 52, 13, 66]. All of these methods train on purely synthetic data and focus on isolated objects. Although they show limited evaluation on real-world data, a larger evaluation on realistic scanning data is largely missing. When training data is synthetic and test is performed on real data,there is also a significant discrepancy of test performance,as data characteristics, such as noise and occlusions patterns,are inherently different.
随着大型合成3D数据集（例如[91、6]）的普及和3D深度学习的最新进展，研究开发了仅使用具有体积深网的几何数据对对象进行分类的方法[91、82、52、13 ，66]。所有这些方法都针对纯合成数据进行训练，并专注于孤立的对象。尽管他们对现实世界数据的评估有限，但仍缺少对现实扫描数据的更大评估。当训练数据是合成的并且对真实数据进行测试时，由于噪声和遮挡模式等数据特性固有地不同，因此测试性能也存在很大差异。

With ScanNet, we close this gap as we have captured a sufficiently large amount of 3D data to use real-world RGBD input for both training and test sets. For this task, we use the bounding boxes of annotated objects in ScanNet, and isolate the contained geometry. As a result, we obtain local volumes around each object instance for which we know the annotated category. The goal of the task is to classify the object represented by a set of scanned points within a given bounding box. For this benchmark, we use 17 categories,with 9, 677 train instances and 2, 606 test instances.
借助ScanNet，我们已经捕获了足够多的3D数据，可以将真实的RGBD输入用于训练和测试集，从而缩小了这一差距。对于此任务，我们使用ScanNet中带注释的对象的边界框，并隔离所包含的几何图形。结果，我们获得了我们知道带注释类别的每个对象实例周围的局部体积。任务的目标是对由给定边界框中的一组扫描点表示的对象进行分类。对于此基准测试，我们使用17个类别，其中9677个训练实例和2606个测试实例。

Network and training. For object classification, we follow the network architecture of the 3D Network-in-Network of [66], without the multi-orientation pooling step. In order to classify partial data, we add a second channel to the 303 occupancy grid input, indicating known and unknown regions (with 1 and 0, respectively) according to the camera scanning trajectory. As in Qi et al. [66], we use an SGD solver with learning rate 0.01 and momentum 0.9, decaying the learning rate by half every 20 epochs, and training the model for 200 epochs. We augment training samples with 12 instances of different rotations (including both elevation and tilt), resulting in a total training set of 111, 660 samples.
网络和培训。对于对象分类，我们遵循[66]的3D网络中网络的网络体系结构，而没有多方向合并步骤。为了对部分数据进行分类，我们向303占用栅格输入添加了第二个通道，根据相机扫描轨迹指示已知和未知区域（分别为1和0）。如齐等。 [66]，我们使用学习率0.01和动量0.9的SGD求解器，每20个周期将学习率衰减一半，并训练200个周期的模型。我们使用12个不同旋转实例（包括高程和倾斜度）来扩充训练样本，从而形成总共111660个样本的训练集。

Benchmark performance. As a baseline evaluation, we run the 3D CNN approach of Qi et al. [66]. Table 4 shows the performance of 3D shape classification with different train and test sets. The first two columns show results on synthetic test data from ShapeNet [6] including both complete and partial data. Naturally, training with the corresponding synthetic counterparts of ShapeNet provides the best performance, as data characteristics are shared. However,the more interesting case is real-world test data (right-most two columns); here, we show results on test sets of SceneNN [32] and ScanNet. First, we see that training on synthetic data allows only for limited knowledge transfer (first two rows). Second, although the relatively small SceneNN dataset is able to learn within its own dataset to a reasonable degree, it does not generalize to the larger variety of environments found in ScanNet. On the other hand,training on ScanNet translates well to testing on SceneNN;as a result, the test results on SceneNN are significantly improved by using the training data from ScanNet. Interestingly enough, these results can be slightly improved when mixing training data of ScanNet with partial scans of ShapeNet (last row).
基准性能。作为基线评估，我们运行Qi等人的3D CNN方法。 [66]。表4显示了使用不同的训练集和测试集进行3D形状分类的性能。前两列显示了来自ShapeNet [6]的综合测试数据的结果，包括完整和部分数据。自然，与共享的ShapeNet对应合成器进行训练可提供最佳性能，因为数据特征是共享的。但是，更有趣的情况是真实的测试数据（最右边的两列）。在这里，我们显示了SceneNN [32]和ScanNet的测试集上的结果。首先，我们看到对合成数据的培训仅允许有限的知识转移（前两行）。其次，尽管相对较小的SceneNN数据集能够在其自己的数据集中进行一定程度的学习，但它并不能推广到ScanNet中发现的各种环境。另一方面，在ScanNet上进行训练可以很好地转换为SceneNN上的测试；因此，使用ScanNet的训练数据可以显着改善SceneNN上的测试结果。有趣的是，将ScanNet的训练数据与ShapeNet的部分扫描（最后一行）混合时，可以稍微改善这些结果。
在这里插入图片描述
5.2. Semantic Voxel Labeling
A common task on RGB data is semantic segmentation (i.e. labeling pixels with semantic classes) [49]. With our data, we can extend this task to 3D, where the goal is to predict the semantic object label on a per-voxel basis. This task of predicting a semantic class for each visible 3D voxel has been addressed by some prior work, but using handcrafted features to predict a small number of classes [41,86], or focusing on outdoor environments [8, 5].
RGB数据的常见任务是语义分割（即用语义类标记像素）[49]。利用我们的数据，我们可以将该任务扩展到3D，目标是在每个体素的基础上预测语义对象标签。先前的工作已经解决了为每个可见3D体素预测语义类别的任务，但是使用手工制作的功能来预测少量类别[41,86]或专注于室外环境[8，5]。

Data Generation. We first voxelize a scene and obtain a dense voxel grid with 2cm3 voxels, where every voxel stores its TSDF value and object class annotation (empty space and unlabeled surface points have their own respective classes). We now extract subvolumes of the scene volume,of dimension 2 × 31 × 31 × 62 and spatial extent 1.5m × 1.5m × 3m; i.e., a voxel size of 4.8cm3; the two channels represent the occupancy and known/unknown space according to the camera trajectory. These sample volumes are aligned with the xy-ground plane.For ground truth data generation, voxel labels are propagated from the scene voxelization to these sample volumes. The samples are chosen that 2% of the voxels are occupied (i.e., on the surface),and 70% of these surface voxels have valid annotations;samples not meeting these criteria are discarded.Across ScanNet, we generate 93, 721 subvolume examples for training, augmented by 8 rotations each (i.e., 749, 768 training samples), from 1201 training scenes. In addition,we extract 18, 750 sample volumes for testing, which are also augmented by 8 rotations each (i.e., 150, 000 test samples) from 312 test scenes. We have 20 object class labels plus 1 class for free space.
数据生成。我们首先对一个场景进行体素化，并获得一个具有 $2{cm}^3$ 体素的密集体素网格，每个体素都在其中存储其TSDF值和对象类注释（空白空间和未标记的表面点具有各自的类）。现在我们提取场景体积的子体积，尺寸为2×31×31×62，空间范围为1.5m×1.5m×3m；即体素大小为 $4.8cm^3$ ；这两个通道根据摄像机的轨迹分别代表占用率和已知/未知空间。这些样本量与xy地面对齐。为生成地面真实数据，将体素标签从场景体素化传播到这些样本量。选择样本时要占据 $> = 2 ％$ 的体素（即在表面上），并且其中 $> = 70 ％$ 的表面体素具有有效注释；不符合这些条件的样本将被丢弃。在ScanNet中，我们生成93、721个子体积训练示例，从1201个训练场景中每增加8个旋转（即749、768个训练样本）。此外，我们从18个测试场景中提取了750个样本量，并分别增加了8个旋转次数（即150，000个测试样本）。我们有20个对象类标签以及1个可用空间类。

Network and training. For the semantic voxel labeling task, we propose a network which predicts class labels for a column of voxels in a scene according to the occupancy characteristics of the voxels’ neighborhood. In order to infer labels for an entire scene, we use the network to predict a label for every voxel column at test time (i.e., every xy position that has voxels on the surface). The network takes as input a 2×31×31×62 volume and uses a series of fully convolutional layers to simultaneously predict class scores for the center column of 62 voxels. We use ReLU and batch normalization for all layers (except the last) in the network.To account for the unbalanced training data over the class labels, we weight the cross entropy loss with the inverse log of the histogram of the train data.
网络和培训。对于语义体素标注任务，我们提出了一个网络，该网络根据体素邻域的占用特征预测场景中一列体素的类标签。为了推断整个场景的标签，我们使用网络来预测测试时每个体素列的标签（即表面上具有体素的每个xy位置）。该网络以2×31×31×62的体积作为输入，并使用一系列完全卷积的层来同时预测62个体素中心列的类分数。我们对网络中的所有层（最后一层除外）都使用ReLU和批量归一化。为解决类标签上不平衡的训练数据，我们用训练数据直方图的逆对数对交叉熵损失进行加权。

We use an SGD solver with learning rate 0.01 and momentum 0.9, decaying the learning rate by half every 20 epochs, and train the model for 100 epochs.
我们使用学习率0.01和动量0.9的SGD求解器，每20个周期将学习率衰减一半，并训练100个周期的模型。
在这里插入图片描述
Quantitative Results. The goal of this task is to predict semantic labels for all visible surface voxels in a given 3D scene; i.e., every voxel on a visible surface receives one of the 20 object class labels. We use NYU2 labels, and list voxel classification results on ScanNet in Table 7. We achieve an voxel classification accuracy of 73.0% over the set of 312 test scenes, which is based purely on the geometric input (no color is used).
定量结果。该任务的目标是预测给定3D场景中所有可见表面体素的语义标签。也就是说，可见表面上的每个体素都会收到20个对象类别标签之一。我们使用NYU2标签，并在表7中列出了ScanNet上的体素分类结果。我们在312个测试场景的集合中实现了73.0％的体素分类精度，这完全基于几何输入（不使用颜色）。
在这里插入图片描述
In Table 5, we show our semantic voxel labeling results on the NYU2 dataset [58]. We are able to outperform previous methods which are trained on limited sets of real-world data using our volumetric classification network. For instance,Hermans et al. [31] classify RGB-D frames using a dense random decision forest in combination with a conditional random field. Additionally, SemanticFusion [54] uses a deep net trained on RGB-D frames, and regularize the predictions with a CRF over a 3D reconstruction of the frames; note that we compare to their classification results before the CRF regularization. SceneNet trains on a large synthetic dataset and fine-tunes on NYU2. Note that in contrast to Hermans et al. and SemanticFusion, neither we nor
SceneNet use RGB information.
在表5中，我们在NYU2数据集上显示了语义体素标记结果[58]。我们能够胜过以前使用体积分类网络在有限的真实数据集上进行训练的方法。例如，Hermans等。 [31]使用密集的随机决策森林结合条件随机场对RGB-D帧进行分类。另外，SemanticFusion [54]使用在RGB-D帧上训练的深网，并在帧的3D重构上使用CRF规范预测；请注意，我们将其与CRF正则化之前的分类结果进行了比较。 SceneNet在大型合成数据集上进行训练，并在NYU2上进行微调。注意，与Hermans等人相反。和SemanticFusion，无论是我们还是
SceneNet使用RGB信息。

Note that we do not explicitly enforce prediction consistency between neighboring voxel columns when the test volume is slid across the xy plane. This could be achieved with a volumetric CRF [64], as used in [86]; however, our goal in this task to focus exclusively on the per-voxel classification accuracy.
请注意，当测试体积在xy平面上滑动时，我们没有明确地强制相邻体素列之间的预测一致性。如[86]中所用，可以使用体积CRF [64]来实现。但是，我们在此任务中的目标仅专注于按体素分类的准确性。
5.3. 3D Object Retrieval
Another important task is retrieval of similar CAD models given (potentially partial) RGB-D scans. To this end,one wants to learn a shape embedding where a feature descriptor defines geometric similarity between shapes. The core idea is to train a network on a shape classification task where a shape embedding can be learned as byproduct of the classification task. For instance,Wu et al. [91] and Qi et al. [66] use this technique to perform shape retrieval queries within the ShapeNet database. With ScanNet, we have established category-level correspondences between real-world objects and ShapeNet models.This allows us to train on a classification problem where both real and synthetic data are mixed inside of each category using real and synthetic data within shared class labels.
另一个重要任务是在给定（可能是部分）RGB-D扫描的情况下，检索相似的CAD模型。为此，人们想学习一种形状嵌入，其中特征描述符定义形状之间的几何相似性。核心思想是在形状分类任务上训练网络，在该任务中可以将形状嵌入作为分类任务的副产品。例如，吴等。 [91]和Qi等。 [66]使用这种技术在ShapeNet数据库中执行形状检索查询。借助ScanNet，我们在现实世界的对象和ShapeNet模型之间建立了类别级别的对应关系，这使我们可以解决一个分类问题，即使用共享类标签中的真实和合成数据将真实数据和合成数据混入每个类别中。
在这里插入图片描述
Thus, we can learn an embedding between real and synthetic data in order to perform model retrieval for RGB-D scans. To this end, we use the volumetric shape classification network by Qi et al. [66], we use the same training procedure as in Sec 5.1. Nearest neighbors are retrieved based on the ℓ2 distance between the extracted feature descriptors,and measured against the ground truth provided by the CAD model retrieval task. In Table 6, we show object retrieval results using objects from ScanNet to query for nearest neighbor models from ShapeNetCore. Note that training on ShapeNet and ScanNet independently results in poor retrieval performance, as neither are able to bridge the gap between the differing characteristics of synthetic and real-world data. Training on both ShapeNet and ScanNet together is able to find an embedding of shape similarities between both data modalities, resulting in much higher retrieval accuracy.
因此，我们可以学习真实数据和合成数据之间的嵌入，以便对RGB-D扫描执行模型检索。为此，我们使用Qi等人的体积形状分类网络。 [66]，我们使用与5.1节中相同的训练程序。基于提取的特征描述符之间的ℓ2距离，检索最近的邻居，并根据CAD模型检索任务提供的地面真实性进行测量。在表6中，我们显示了使用来自ScanNet的对象从ShapeNetCore查询最近的邻居模型的对象检索结果。请注意，对ShapeNet和ScanNet进行培训会分别导致较差的检索性能，因为这两种方法都无法弥合合成数据和实际数据的不同特征之间的鸿沟。一起对ShapeNet和ScanNet进行培训能够找到两种数据模态之间形状相似的嵌入，从而获得更高的检索精度。

6. Conclusion

This paper introduces ScanNet: a large-scale RGBD dataset of 1513 scans with surface reconstructions,instance-level object category annotations, and 3D CAD model placements. To make the collection of this data possible,we designed a scalable RGB-D acquisition and semantic annotation framework that we provide for the benefit of the community. We demonstrated that the richlyannotated scan data collected so far in ScanNet is useful in achieving state-of-the-art performance on several 3D scene understanding tasks; we hope that ScanNet will inspire future work on many other tasks.
本文介绍了ScanNet：具有表面重建，实例级对象类别注释和3D CAD模型放置的1513次扫描的大规模RGBD数据集。为了使这些数据的收集成为可能，我们设计了可扩展的RGB-D采集和语义注释框架，为社区带来了好处。我们证明，到目前为止，在ScanNet中收集的带注释注释的扫描数据对于在几个3D场景理解任务上实现最先进的性能很有用。我们希望ScanNet能够激发未来在许多其他任务上的工作。

john_bh

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
[3D分割 Benchmak] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes

引入了ScanNet，这是一个RGB-D视频数据集，其中包含1513个场景中的2.5M视图，并标注了3D相机姿势，表面重建和语义分割。为了收集这些数据，我们设计了一个易于使用且可扩展的RGB-D捕获系统，该系统包括自动表面重建和众包语义标注。我们证明了使用这些数据有助于在多个3D场景上实现最新的性能了解任务，包括3D对象分类，语义体素标签和CAD模型检索。
复制链接

扫一扫

专栏目录