论文阅读《3D Object Detection Method Based on YOLO and K-Means for Image and Point Clouds 》

最新推荐文章于 2023-07-21 09:13:03 发布

Max_ZhangJF

最新推荐文章于 2023-07-21 09:13:03 发布

阅读量1.2k

点赞数 1

分类专栏：论文阅读文章标签：自动驾驶深度学习机器学习

本文链接：https://blog.csdn.net/Max_ZhangJF/article/details/120135972

版权

论文阅读专栏收录该内容

3 篇文章

订阅专栏

本文提出了一种利用激光雷达和摄像头的3D物体检测方法，通过YOLO进行2D检测，点云提取，然后利用K-means进行无监督聚类以提高精度。实验结果显示，这种方法在实时性和精度上有显著优势，尤其是在深度图像评价中。研究还展示了在自动驾驶和深度图像分析中的应用潜力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在这里插入图片描述

论文链接：

http://cn.arxiv.org/abs/2005.02132v1

Abstract

Lidar based 3D object detection and classification tasks are essential for autonomous driving(AD). A lidar sensor can provide the 3D point cloud data reconstruction of the surrounding environment. However, real time detection in 3D point clouds still needs a strong algorithmic.
基于激光雷达的三维目标检测与分类任务是自动驾驶的基础。激光雷达传感器可以提供周围环境的三维点云数据重建。然而，三维点云的实时检测仍然需要强大的算法。
This paper proposes a 3D object detection method based on point cloud and image which consists of there parts.(1)Lidar-camera calibration and undistorted image transformation. (2)YOLO-based detection and PointCloud extraction, (3)K-means based point cloud segmentation and detection experiment test and evaluation in depth image.
提出了一种基于点云和图像的三维目标检测方法，该方法由点云和图像组成。(2)基于yolo的检测和点云提取;(3)基于K-means的点云分割和检测实验测试与深度图像评价。
In our research, camera can capture the image to make the Real-time 2D object detection by using YOLO, we transfer the bounding box to node whose function is making 3d object detection on point cloud data from Lidar. By comparing whether 2D coordinate transferred from the 3D point is in the object bounding box or not can achieve High-speed 3D object recognition function in GPU.
在我们的研究中，摄像机可以利用YOLO捕捉图像进行实时的二维目标检测，我们将包围盒传递给节点，节点的功能是对激光雷达的点云数据进行三维目标检测。通过比较三维点传递的二维坐标是否在物体包围盒中，可以实现GPU中的高速三维物体识别功能。
The accuracy and precision get imporved after k-means clustering in point cloud. The speed of our detection method is a advantage faster than PointNet.
通过对点云进行k均值聚类，提高了聚类的精度和精度。我们的检测方法的速度比PointNet更快。

I. INTRODUCTION

Great progress has been made on 2D image understanding tasks, such as object detection and instance segmentation [1]. However, since the creation of 2D bounding boxes or pixel masks, real time detection on 3D point cloud data is becoming increasingly important in many applications areas, such as autonomous driving (AD) and augmented reality (AR). This paper presents our experiments on 3D object detection tasks, which are one of the most important tasks in 3D computer vision. Also presented is the analysis of the experimental results in precision, accuracy, recall, and time and possible future work to improve the 3D object detection average precision (AP).
在二维图像理解任务方面，如目标检测和实例分割[1]取得了很大的进展。然而，随着二维包围盒或像素蒙版的创建，对三维点云数据的实时检测在自动驾驶(AD)、增强现实(AR)等应用领域变得越来越重要。本文介绍了我们在三维目标检测任务上的实验，这是三维计算机视觉中最重要的任务之一。本文还对三维目标检测的精度、准确度、查全率和时间等方面的实验结果进行了分析，并提出了提高三维目标检测平均精度的可能的未来工作。
In the AD field, the LIDAR sensor is the most common 3D sensor. It generates 3D point clouds and captures the 3D structure of scenes. The difficulty of point cloud-based 3D object detection mainly lies in the irregularity of the point clouds from LIDAR sensors 2]. Thus, state-of-art 3D object detection methods either leverage a mature 2D detection framework by projecting the point clouds into a bird s eye view or into a frontal view [2].However, the information on the point cloud will suffer loss during the quantization process. Charles et al. at Stanford University published a paper on CVPR in 2017, in which he proposed a deep learning network called PointNet that directly handles point clouds.
在AD领域，激光雷达传感器是最常见的3D传感器。它生成三维点云，捕捉场景的三维结构。基于点云的三维物体检测的难点主要在于激光雷达传感器得到的点云的不规则性2]。因此，先进的3D物体检测方法利用成熟的2D检测框架，将点云投射到鸟瞰图或正面视图[2]。但是，在量化过程中，点云上的信息会受到损失。 2017年，斯坦福大学的Charles等人发表了一篇关于CVPR的论文，提出了一个直接处理点云的深度学习网络PointNet。
This paper was a milestone, marking the point cloud processing entered a new stage. The reason is that before PointNet, we had no way to deal with point clouds directly. Because point clouds are three-dimensional, they are not smooth. Moreover, deep neural networks, which make many ordinary algorithms, do not work. Thus, researchers have come up with a variety of methods[1][2][3], such as flattening the point clouds into pictures (MVCNN), dividing the point clouds into voxels, and then dividing them into nodes and straightening them in order. Thus, the cloud domain has advanced from the pre-PointNet era to the post-PointNet era thanks to the development of this technology. After PointNet, PointCNN, SO-Net, etc., came out, the operation of these methods improved steadily.
本文是一个里程碑，标志着点云处理进入了一个新的阶段。原因是在PointNet之前，我们没有办法直接处理点云。因为点云是三维的，它们不是光滑的。此外，与许多普通算法相比，深度神经网络并不起作用。因此，研究人员提出了多种方法[1][2][3]，如将点云扁平化成图片(MVCNN)，将点云划分为体素，然后将其划分为节点，并按顺序进行拉直。因此，云领域从前pointnet时代推进到后pointnet时代，正是得益于这一技术的发展。在PointNet、PointCNN、SO-Net等方法问世后，这些方法的操作都在稳步提高。
PointNet[3] has achieved 83.7 percent mean accuracy. However, the speed is still a problem. Compared with two-dimensional data, point cloud data with an additional dimension are too large to achieve the requirements for real-time 3D object detection. This paper presents our extraction of every point that may be an object after transformation in a 2D bounding box, enabling high-speed 3D object detection to be achieved. First, we describe a device we constructed including six cameras and one LIDAR. Then, we present the experiments we conducted to show how 2D images are captured by cameras and how 3D point clouds using LIDAR store the data in a rosbag, which was reused in subsequent experiments. The image data needed to be distorted, but an undistorted transform process was also needed. After the undistorted transform process, five images are split and dropped into the you only look once (YOLO) detection process. The YOLO detection process returns the related bounding box and class label. We store the bounding box and the class label for later reading, reducing the coupling of project research. In the second step, a function is written to extract the different topic information of the rosbag. After extracting the point cloud file, we put it into the numpy matrix for future operations.

PointNet[3]的平均准确率为83.7%。然而，速度仍然是个问题。与二维数据相比，附加了维度的点云数据太大，无法满足实时三维目标检测的要求。本文提出在二维边界框中提取变换后可能为物体的每一个点，实现高速三维物体检测。首先，我们描述一个我们构建的设备，包括六个摄像头和一个激光雷达。然后，我们展示了我们所进行的实验，展示了如何通过相机捕获二维图像，以及如何使用激光雷达将三维点云数据存储在rosbag中，这些数据在后续的实验中被重用。图像数据需要畸变，但也需要一个未畸变的变换过程。在未失真变换过程之后，将五幅图像分割并放入你只看一次(YOLO)检测过程中。YOLO检测过程返回相关的边界框和类标签。我们存储了边界框和类标签以供以后阅读，减少了项目研究的耦合。在第二步中，编写一个函数来提取rosbag的不同主题信息。在提取点云文件后，我们将其放入numpy矩阵中，以供后续操作。
Data conversion based on external and internal parameters is performed for every point cloud by matching the 2D images corresponding to each point cloud by matching the fps of the cameras and LIDAR. For each different bounding box of each point cloud, we collect all the matching points and render different colors based on different class labels. Finally, unsupervised clustering of point clouds in different bounding boxes improves the detection performance by removing some of the noise. The results of 3D object recognition are presented at the end of the paper. The recognition results were saved in a rosbag, and then 3D visualization was performed to check the experimental results.
通过匹配摄像机和激光雷达的fps帧数，对每个点云对应的2D图像进行基于外部和内部参数的数据转换。对于每个点云的每个不同的包围盒，我们收集所有匹配的点，并根据不同的类标签渲染不同的颜色。最后，对不同边界盒中的点云进行无监督聚类，通过去除一些噪声来提高检测性能。最后给出了三维目标识别的结果。识别结果保存在rosbag，然后进行3D可视化检查实验结果。
The evaluation results of the recognition experiment were completed using the method of depth images; the point cloud detection results were transferred to 321024 depth images. The final evaluation experiment was done for a comparison with ground truth in every pixel. The aforementioned is the rough research process of this article.
**利用深度图像的方法完成了识别实验的评价结果;将点云检测结果传输到321024深度的图像中。最后进行评价实验，与每个像素的地面真实值进行比较。以上就是本文的大致研究过程。**

II. 3D OBJECT DETECTION METHOD

A. Overview

在这里插入图片描述
Fig.1 shows the overview of the proposed system. This research was basically divided into six parts. The first part mainly focused on the calibration of the cameras and the structural design of the testing equipment. The second part was to convert the distorted images into undistorted ones. The third part was YOLOv3 object recognition with 2D images. We mainly applied YOLOv1 tiny and YOLOv3 methods in doing the experiments, using keras to reproduce YOLO. The fourth part was the extraction of point clouds. We used rosbag to store the data and RVIZ for point cloud visualization. The fifth part was the unsupervised clustering of k-means, which were used to optimize the detection results of the basic experiments and to improve the detection accuracy of 3D object recognition.

图1显示了该系统的概述。本研究主要分为六个部分。第一部分主要介绍了摄像机的标定和测试设备的结构设计。第二部分是将失真的图像转换为未失真的图像。第三部分是基于二维图像的YOLOv3目标识别。我们主要采用YOLOv1 tiny和YOLOv3的方法进行实验，使用keras来复制YOLO。第四部分是点云的提取。我们使用rosbag存储数据，使用RVIZ实现点云可视化。第五部分是k-means的无监督聚类，用于优化基础实验的检测结果，提高三维物体识别的检测精度。

B. Lidar-camera calibration

Here is the main information on the equipment used in this experiment and the external reference of the cameras.This experiment used Velodyne lidar(HDL-32e) with omni-directional cameras(PointGrey Ladybag5) to achieve 360 no dead angle monitoring, which is shown in Fig. 2.
这里是本实验所用设备的主要信息和相机的外部参考。本实验采用Velodyne激光雷达(HDL-32e)和全向摄像机(PointGrey Ladybag5)实现360度无死角监测，如图2所示。
在这里插入图片描述
A geometric model of camera imaging must be established during the image measurement process and machine vision application to determine the relationship between the three-dimensional geometric position of a point on the surface of a space object and its corresponding point in the image. These geometric model parameters are camera parameters. Under most conditions, these parameters must be obtained through experiments and calculations[4]. This process of solving parameters is called camera calibration. The internal parameters of the five cameras obtained at the end of this study are shown in Fig. 3.
在图像测量和机器视觉应用过程中，必须建立相机成像的几何模型，以确定空间物体表面上某一点的三维几何位置与其在图像中的对应点之间的关系。这些几何模型参数就是摄像机参数。在大多数情况下，这些参数必须通过实验和计算得到。这个求解参数的过程称为摄像机标定。研究结束时获得的5台摄像机的内部参数如图3所示。
在这里插入图片描述

C. Image undistorted transform

In photography, wide-angle lenses are generally believed to be prone to barrel distortion, while telephoto lenses are prone to pincushion distortion. If a camera uses a short focal length wide-angle lens, the resulting image will be more susceptible to barrel distortion because the magnification of the lens gradually decreases as the distance increases, causing the image pixels to surround the center point radially. Fig. 4 shows raw images, and Fig. 5 shows images after the undistorted transform using OpenCV for image correction and camera calibration.
在摄影中，广角镜头通常被认为容易产生筒形失真，而长焦镜头则容易产生针形失真。如果相机使用短焦距广角镜头，产生的图像将更容易受到筒形失真的影响，因为随着距离的增加，镜头的放大倍率逐渐降低，导致图像像素围绕中心点呈放射状。图4为原始图像，图5为使用OpenCV进行图像校正和摄像机标定后的图像。
在这里插入图片描述

D. YOLO-based detection

YOLO is a fast target detection algorithm that is very useful for tasks with very high real-time requirements. The YOLO authors launched YOLOv3 version in 2018. After training on Titan X, v3 is 3.8 times faster than RetinaNet regarding mean average precision (mAP), and it can create a 320320 picture in 22 ms. The objective score is 51.5, which is comparable to the accuracy of the single shot detector (SSD), but it is three times faster. Thus, YOLOv3 is very fast and accurate. In the case of IoU=0.5, it is equivalent to the mAP value of Focal Loss, but it is four times faster. We utilized YOLOv3 as a 2D object detection algorithm. Fig. 6 shows an example from a camera image, and Fig. 7 shows an example of 3D object detection from a point cloud.
YOLO是一种快速的目标检测算法，对于实时性要求很高的任务非常有用。YOLO的作者于2018年推出了YOLOv3版本。在Titan X上进行训练后，v3的平均平均精度(mAP)比retina anet快3.8倍，可以在22毫秒内生成320320图像。客观得分是51.5，与单镜头检测器(SSD)的准确性相当，但它的速度是单镜头检测器的3倍。因此，YOLOv3是非常快速和准确的。在IoU=0.5的情况下，相当于Focal Loss的mAP值，但要快4倍。我们使用YOLOv3作为一个二维目标检测算法。图6为来自摄像机图像的示例，图7为来自点云的3D目标检测示例。
在这里插入图片描述

We present a total of all the classes that can be identified in the coco dataset, including people, bicycles, cars, motorbikes, airplanes, buses, trains, trucks, boats, traffic, lights, fires, hydrants, and stop lights. In total, 80 classes are presented. However, the most frequent classes are trucks, people, and cars. Thus, training a new YOLOv3 neural network to detect these three classes may be useful in saving detection time and enabling real-time functionality.
我们提供了可在coco数据集中识别的所有类的总数，包括人、自行车、汽车、摩托车、飞机、公共汽车、火车、卡车、船只、交通、灯、火、消防栓和停车灯。总共有80个类。然而，最常见的类是卡车、人和汽车。因此，训练一个新的YOLOv3神经网络来检测这三个类可能有助于节省检测时间和实现实时功能。
Because the images are largely black after undistorted conversion, we must remove the noise that exceeds its maximum bounding box when doing a conversion. Because the boundingbox of more than a quarter of the image size contains black parts, this part is invariably noisy. Fig 6 shows the YOLO detection examples from the camera image.
因为未经失真转换后的图像大部分是黑色的，所以在进行转换时，我们必须去除超过其最大边界框的噪声。由于超过图像大小四分之一的边界框包含黑色部分，这部分总是有噪声的。图6为摄像机图像中的YOLO检测示例。

E. Point cloud extraction

We mainly used rosbag to read and output the data. The read data contained undistorted images and point cloud images. The output was mainly the result of point cloud detection.
我们主要使用rosbag来读取和输出数据。读取的数据包括未失真的图像和点云图像。输出主要是点云检测的结果。

F. K-means based point cloud segmentation

Because 2D is converted into 3D, all points that can be mapped into the bounding box in a certain direction are marked with different label colors. To improve the experiment, we utilized the unsupervised clustering method for k-means machine learning. The detection was faster, and the accuracy of the points substantially improved, removing most of the noise points. The k-means clustering graph is shown in Fig. 8, and the 3D object detection examples by this method is shown in Fig. 7. Because 2D is converted into 3D, all points that can be mapped into the bounding box in a certain direction are marked with different label colors. To improve the experiment, we utilized the unsupervised clustering method for k-means machine learning. The detection was faster, and the accuracy of the points substantially improved, removing most of the noise points. The k-means clustering graph is shown in Fig.8, and the 3D object detection examples by this method is shown in Fig. 7.
由于将2D转换为3D，所以在一定方向上可以映射到包围盒的所有点都用不同的标签颜色进行标记。为了改进实验，我们使用无监督聚类方法进行k-means机器学习。检测速度更快，点的精度有了很大的提高，去除了大部分的噪声点。k-means聚类图如图8所示，该方法的三维目标检测示例如图7所示。
在这里插入图片描述

G. Evaluation of prediction results in depth images

To create a comparison with ground truth and to make the results easier to observe, we converted the point clouds into a 31024 panoramic depth image that is shown in Fig. 9 with different colors corresponding to different categories. Depth image generation properties are shown in Table I. In this step, handmade 0.1K ground truth images were created using the LabelMe annotation tool. The final evaluation results also included these 0.1K pictures.
**为了与ground truth进行对比，并使结果更容易观察，我们将点云转换为如图9所示的31024全景深度图像，不同的颜色对应不同的类别。深度图像生成属性如表i所示。在此步骤中，使用LabelMe注释工具手工创建0.1K地面真值图像。最终的评价结果也包括了这些0.1K的图片。**
在这里插入图片描述

III. EXPERIMENTS AND RESULTS

An in-vehicle sensor was used, and a large number of data were collected. The identification experiment consisted of two parts. The first included visualization and quantity statistics of point cloud identification with and without k-means clustering. The experimental results were then evaluated. The second part included evaluation criteriathe accuracy, precision, and recall by comparing the results after conversion to depth images with ground truth.
采用车载传感器，收集了大量数据。鉴定实验分为两部分。第一个包括有和没有k-means聚类的点云识别的可视化和数量统计。然后对实验结果进行评估。第二部分通过对比深度图像与地面真值转换后的结果，包括精度、精度和召回率的评价标准。

A. Experiments with different classes

In this study, we first made 3D prediction results that were directly converted. Basically, all the points on the image that could be mapped to the bounding box were recognized. This led to a lot of noise, making it impossible to identify the 3D bounding box, but enabled recognizing the 3D radar data. A specific category exists in a certain direction, thereby completing the rough 3D recognition function. A 2D YOLO example is shown in Fig. 10, and related experiment results with and without k-means are shown in Fig. 11.
在本研究中，我们首先将3D预测结果直接转换。基本上，图像上所有可以映射到边界框的点都被识别出来了。这导致了大量的噪声，使其无法识别3D边界框，但使识别3D雷达数据成为可能。某一特定类别在某一方向存在，从而完成粗三维识别功能。2D YOLO实例如图10所示，含k均值和不含k均值的相关实验结果如图11所示。
在这里插入图片描述

The proofreading here is mainly for the naked eye. In accordance with the specific bounding box data and the data under the camera, judgments were made as to whether or not the recognition results were correct.
这里的校对主要是用肉眼进行的。根据具体的包围盒数据和相机下的数据，判断识别结果是否正确。

B. Experiments with k-means

K-means clustering is a method of vector quantization, originally from signal processing, which is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells [4].
k均值聚类是一种矢量量化的方法，起源于信号处理，是数据挖掘中常用的聚类分析方法。k -means聚类的目的是将n个观测数据划分为k个聚类，其中每个观测数据都属于均值最近的聚类，作为聚类的原型。这将导致将数据空间分区到Voronoi单元[4]中。
Because k-means pre-determines the number of categories and the maximum number of cluster iterations, the center point is randomly selected at first, so the results of each cluster become biased. However, the expected value of each category did not change too much at the end of the experiment.
由于k-means预先确定了类别个数和最大的聚类迭代次数，所以首先随机选取中心点，使得每个聚类的结果有偏差。但各类别的期望值在实验结束时变化不大。
After using YOLO to complete the 2D object recognition and to convert the data into 3D point cloud data, we mainly used k-means to cluster further the points of the corresponding bounding box that had been acquired so as to remove some noise and to make the recognition results more accurate.
在使用YOLO完成2D物体识别并将数据转化为3D点云数据后，我们主要使用k-means对已经获取到的对应bounding box的点进行进一步聚类，去除一些噪声，使识别结果更准确。

C. Results with and without unsupervised learning

The results of the experiment with and without k-means are shown in Fig. 11. The total number of point clouds in each group was 46,464. The maximum number of points was 4989. This occurred when a car or truck got very close. The lowest number was 0, which means that no target object could be recognized around that time. In other words, it was an empty space with no pre-trained class label objects.
使用和不使用 k-means 的实验结果如图 11 所示。每组中点云的总数为 46,464。最大点数为 4989。这发生在汽车或卡车非常接近时。最低的数字是 0，这意味着在那个时候无法识别目标对象。换句话说，这是一个没有预训练类标签对象的空白空间。
在这里插入图片描述
Fig. 12 and Fig.13 represents the ratio and number of dropped data after clustering using the k-means method on the point cloud dataset. The highest was 49.15 percent in 46 frames. In the case of not recognizing any objects, the lowest was only 0.85 percent in 21 frames.
图12和图13表示在点云数据集上使用k-means方法聚类后丢弃数据的比例和数量。在 46 帧中最高为 49.15%。在没有识别任何物体的情况下，最低的只有 21 帧中的 0.85%。
在这里插入图片描述

When clustering was not performed, the experimental results stained all the points mapped to the bounding box in that direction. After clustering, if there was an object, the number of point clouds in that part significantly increased, while other places that had no object obviously had a noticeable decrease in the number of clouds. Therefore, unsupervised clustering could significantly change the results, generally removing 31.6 percent of the point clouds.
当不进行聚类时，实验结果将所有映射到该方向边界框的点染色。聚类后，如果有物体，那部分点云的数量明显增加，而其他没有物体的地方，点云的数量明显减少。因此，无监督聚类可能会显着改变结果，通常会移除 31.6% 的点云。

D. Evaluation of prediction results in depth images

We made a 0.1K ground truth by labeling, reading, and outputting operations on LabelMe software1 . In the previous stage, we got the prediction results of the point clouds. After converting the point clouds to the same 31024 depth image, the 31024*1 image file containing the object category labels were saved too. In the final evaluation test, accuracy, precision, and recall were calculated by comparing the ground truth and the numpy file holding the experimental prediction results. Fig.14 is the Depth image which is converted from point cloud data. The ground truth is shown in Fig. 17, the detection results are in Fig. 16, and the final results after k-means clustering are in Fig. 17. The accuracy and consumed time results for the total prediction process are shown in Table II. Accuracy, precision, and recall with and without the k-means clustering were recorded. The results shown in Table III reveal YOLO 2Ds detection was successful.

我们通过在 LabelMe 软件 1 上进行标记、读取和输出操作，制作了 0.1K 的ground truth。在前一阶段，我们得到了点云的预测结果。将点云转换为相同的 31024 深度图像后，包含对象类别标签的 31024*1 图像文件也被保存。在最终的评估测试中，通过比较ground truth和保存实验预测结果的numpy文件来计算准确率、准确率和召回率。图 14 是由点云数据转换而来的深度图像。 ground truth如图17，检测结果如图16，k-means聚类后的最终结果如图17。整个预测过程的准确率和消耗时间结果如表二所示 . 记录了使用和不使用 k 均值聚类的准确度、精确度和召回率。表 III 中显示的结果表明 YOLO 2Ds 检测是成功的。
在这里插入图片描述

IV. CONCLUSIONS

The conclusion of our study are as follows:
我们的研究结论如下:
1.The method adopted by this paper is to directly convert the 3D point cloud to 2D image data, from the recognition of the 2D boudingbox to the dyeing of the 3D point cloud. Since the YOLO algorithm is adopted, the real-time performance is very strong, and the unsupervised clustering is used too. A lot of noise will be removed. It makes the recognition better.
1.本文采用的方法是将3D点云直接转化为2D图像数据，从2D boudingbox的识别到3D点云的染色。由于采用了YOLO算法，实时性很强，也采用了无监督聚类。很多噪音将被消除。它使识别更好。
2.This paper mainly wants to find a way to quickly and accurately determine whether there are objects and objects in a certain direction. This will contribute to the success of the unmanned field, allowing the car to obtain more information to make more judgments.
2.本文主要想找到一种能够快速、准确地判断是否有物体和某一方向上的物体的方法。这将有助于无人驾驶领域的成功，让汽车获得更多的信息做出更多的判断。
3. The final experimental results, in the case of using two 1080Ti GPUs, basically ensure that the experiment without clustering consumes 0.19 seconds per frame and 0.192 seconds after k-means clustering in 5 threads. The fast identification process ensures the real-time detection of the surrounding conditions in unmanned driving. If parallel, distributed computing and other technologies are used, the recognition speed will be faster.
3、最终的实验结果，在使用两块1080Ti GPU的情况下，基本保证没有聚类的实验每帧消耗0.19秒，5线程k-means聚类后消耗0.192秒。快速的识别过程保证了无人驾驶时对周围情况的实时检测。如果采用并行、分布式计算等技术，识别速度会更快。
4. The speed is very fast. However, the accuracy is not very high due to a front yolo recognition accuracy needs to be considered. The recall for detection is not high too.
4.速度非常快。但是由于需要考虑前置yolo识别精度，所以精度不是很高。检测的召回率也不高。

V. FUTURE WORK

In the future, robots will be added, and semantic mapping from running mobile robots will form the core of the next step. Then, we will consider not only the k-means function but also handling methods for directed point clouds like PointNet and FCN or more clustering methods like point cloud based depth clustering to figure out a faster method to complete 3d object detection using images and lidar. We will develop automatic labeling functions using our method for training data generation of LIDAR-based 3D objects.
未来会加入机器人，运行移动机器人的语义映射将成为下一步的核心。然后，我们不仅会考虑 k-means 函数，还会考虑处理有向点云（如 PointNet 和 FCN）的方法或更多聚类方法（如基于点云的深度聚类），以找出使用图像和激光雷达完成 3d 对象检测的更快方法。我们将使用我们的方法开发自动标记功能，用于训练基于 LIDAR 的 3D 对象的数据生成。

ACKNOWLEDGMENT

We would like to thank Wonjik Kim and Ryusei Hasegawa for providing us with their conversion tools, which transferred the point cloud data to 32*1024 depth images.

我们要感谢 Wonjik Kim 和 Ryusei Hasegawa 为我们提供了他们的转换工具，将点云数据转换为 32*1024 深度图像。