Moving Obatacle Detection in Highly Dynamic Scenes论文翻译

最新推荐文章于 2022-05-06 14:01:25 发布

牧风之谷

最新推荐文章于 2022-05-06 14:01:25 发布

阅读量802

点赞数

本文链接：https://blog.csdn.net/u014285910/article/details/84961929

版权

Moving Obatacle Detection in Highly Dynamic Scenes

高动态场景中的移动障碍物检测

Ess, B. Leibe, K. Schindler, L. van Gool

Abstract— We address the problem of vision-based multiperson tracking in busy pedestrian zones using a stereo rig mounted on a mobile platform. Specifically, we are interested in the application of such a system for supporting path planning algorithms in the avoidance of dynamic obstacles. The complexity of the problem calls for an integrated solution, which extracts as much visual information as possible and combines it through cognitive feedback. We propose such an approach, which jointly estimates camera position, stereo depth, object detections, and trajectories based only on visual information. The interplay between these components is represented in a graphical model. For each frame, we first estimate the ground surface together with a set of object detections. Based on these results, we then address object interactions and estimate

trajectories. Finally, we employ the tracking results to predict future motion for dynamic objects and fuse this information with a static occupancy map estimated from dense stereo. The approach is experimentally evaluated on several long and challenging video sequences from busy innercity locations recorded with different mobile setups. The results show that the proposed integration makes stable tracking and motion prediction possible, and thereby enables path planning in complex and highly dynamic scenes.

摘要：

我们使用安装在移动平台上的立体声解决了繁忙步行区中基于视觉的多人跟踪问题。具体而言，我们感兴趣的是应用这种系统来支持路径规划算法以避免动态障碍。问题的复杂性需要一个集成的解决方案，尽可能多地提取视觉信息，并通过认知反馈进行组合。我们提出了这样一种方法，它仅基于视觉信息联合估计相机位置，立体深度，物体检测和轨迹。这些组件之间的相互作用以图形模型表示。对于每个帧，我们首先估计地面以及一组对象检测。基于这些结果，我们然后解决对象交互和估计轨迹。最后，我们使用跟踪结果来预测动态对象的未来运动，并将此信息与从密集立体声估计的静态占用图融合。该方法通过实验评估来自繁忙的市中心位置的几个时间较长而且具有挑战性的视频序列，记录有不同的移动设置。结果表明，所提出的集成使得稳定的跟踪和运动预测成为可能，从而使路径规划不复杂和高度动态的场景成为可能。

introduction

For reliable autonomous navigation, a robot or car requires appropriate information about both its static and dynamic environment. While remarkable successes have been achieved in relatively clean highway traffic situations [3] and other largely pedestrian-free scenarios such as the DARPA Urban Challenge [6], highly dynamic situations in busy city centers still pose considerable challenges for state-of-the-art approaches.

为了实现可靠的自动导航，机器人或汽车需要有关其静态和动态环境的适当信息。虽然在相对清洁的公路交通情况下，如[3]和其他基本没有行人的情况下，如DARPA城市挑战赛[6]，已经取得了显著的成功，但在繁忙的城市中心，高度动态的情况仍然对最先进的方法构成了相当大的挑战。

For successful path planning in such scenarios where multiple independent motions and frequent partial occlusions abound, it is vital to extract semantic information about individual scene objects. Consider for example the scene depicted in the top left corner of Fig. 1. When just using depth information from stereo or LIDAR, an occupancy map would suggest little free space for driving (bottom left). However, as can be seen in the top right image (taken one second later), the pedestrians free up their occupied space soon after, which would thus allow a robotic platform to pass through without unnecessary and possibly expensive replanning. The difficulty is to correctly assess such situations in complex real-world settings, detect each individual scene object, predict its motion, and infer a dynamic obstacle map from the estimation results (bottom right). This task is made challenging by the extreme degree of clutter, appearance variability, abrupt motion changes, and the large number of independent actors in such scenarios.

对于在多个独立运动和频繁部分遮挡比比皆是的情况下成功的路径规划，提取关于各个场景对象的语义信息是至关重要的。考虑例如图1左上角所示的场景。当仅使用来自立体声或激光雷达的深度信息时，占用地图将表明很少有用于驾驶的自由空间（左下）。然而，正如右上图（一秒钟后拍摄）中所见，行人很快就释放了他们占用的空间，因此这将允许机器人平台通过而无需不必要且可能昂贵的重新计划。难点在于在复杂的现实世界设置中正确评估这种情况，检测每个单独的场景对象，预测其运动，并从估计结果推断出动态障碍物图（右下）。在极端程度的混乱，外观变化，突然的运动变化以及此类情景中的大量独立参与者中，这项任务具有挑战性。

Fig. 1. A static occupancy map (bottom left) can erroneously suggest no free space for navigation, even though space is actually freed up a second later (top right). By using the semantic information from an appearance-based multi-person tracker, we can cast predictions about each tracked person’s future motion. The resulting dynamic obstacle map (bottom right) correctly shows sufficient free space, as the persons walk on along their paths.

图1.静态占用图（左下角）可能错误地建议没有可用于导航的空间，即使空间实际上在一秒后被释放（右上）。通过使用来自基于外观的多人跟踪器的语义信息，我们可以对每个被跟踪者的未来动作进行预测。由此产生的动态障碍物图（右下）正确地显示了足够的自由空间，因为人们沿着他们的路径行走。

In this paper, we propose a purely vision-based approach to address this task. Our proposed system uses as input the video streams from a synchronized, forward-looking camera pair. To analyze this data, the system combines visual object detection and tracking capabilities with continuous self-localization by visual odometry and with 3D mapping based on stereo depth. Its results can be used directly as additional input for existing path planning algorithms to support dynamic obstacles. Key steps of our approach are the use of a state-of-the-art object recognition approach for identifying an obstacle’s category, as well as the reliance on a robust multi-hypothesis tracking framework employing model selection to handle the complex data association problems that arise in crowded scenes. This allows our system to apply category-specific motion models for robust tracking and prediction.

在本文中，我们提出了一种纯粹的基于视觉的方法来解决这一任务。我们提出的系统使用来自同步的前视摄像机对的视频流作为输入。为了分析这些数据，该系统将视觉对象检测和跟踪功能与通过视觉测距和基于立体视觉的3D投射技术的连续自定位相结合。其结果可以直接用作现有路径规划算法的附加输入，以支持动态障碍。我们的方法的关键步骤是使用最先进的物体识别方法来识别障碍物的类别，以及依赖强大的多假设跟踪框架，采用模型选择来处理出现的复杂数据关联问题在拥挤的场景中。这允许我们的系统应用类别特定的运动模型以进行稳健的跟踪和预测。

In order to cope with the challenges of real-world operation, we additionally introduce numerous couplings（耦合） and feedback（反馈） paths between the different components of our system. Thus, we jointly estimate the ground surface and supporting object detections and let both steps benefit from each other. The resulting detections are transferred into world coordinates with the help of visual odometry and are grouped into candidate trajectories by the tracker. Successful tracks are then again fed back to stabilize visual odometry and depth computation through their motion predictions. Finally, the results are combined in a dynamic occupancy map such as the one shown in Fig. 1(bottom right), which allows free space computation for a later navigation module.

为了应对实际操作的挑战，我们还在系统的不同组件之间引入了许多耦合和反馈路径。因此，我们共同估计地面和支撑物体的检测，并让两个步骤相互受益。借助视觉里程计将得到的检测结果转换为世界坐标，并由跟踪器分组为候选轨迹。然后通过其运动预测再次反馈成功的轨道以稳定视觉测距和深度计算。最后，将结果组合在动态占用图中，例如图1（右下）所示的图，其允许后续导航模块的自由空间计算。

The main contribution of this paper is to show that vision based sensing has progressed sufficiently for such a system to become realizable. Specifically, we focus on tracking-by-detection of pedestrians in busy inner-city scenes, as this is an especially difficult but very important application area of future robotic and automotive systems. Our focus on vision alone does not preclude the use of other sensors such as LIDAR or GPS/INS—in any practical robotic system those sensors have their well-deserved place, and their integration can be expected to further improve performance. However, the richness of visual input makes it possible to infer very detailed semantic information about the target scene, and the relatively low sensor weight and cost make vision attractive for many applications.

本文的主要贡献是表明基于视觉的传感已经取得了足够的进展，使这样的系统变得可实现。具体而言，我们专注于在繁忙的市中心场景中检测行人，因为这是未来机器人和汽车系统的一个特别困难但非常重要的应用领域。我们对视觉的关注并不排除在任何实际的机器人系统中使用其他传感器，如激光雷达或GPS / INS，这些传感器都有其当之无愧的地位，并且可以期望它们的集成能够进一步提高性能。然而，视觉输入的丰富性使得可以推断关于目标场景的非常详细的语义信息，并且相对低的传感器重量和成本使得视觉对于许多应用而言是有吸引力的。

The paper is structured as follows: the upcoming section reviews previous work. Section III then gives an overview of the the different components of our vision system with a focus on pedestrian tracking, before Section IV discusses its application to the generation of dynamic occupancy maps.

Implementation details are given in Section V. Finally, we present experimental results on challenging urban scenarios in Section VI, before the paper is concluded in Section VII.

本文的结构如下：即将开始的部分回顾以前的工作。第三部分概述了我们的视觉系统的不同组成部分，重点是行人跟踪，第四部分讨论了它在动态占用图的生成中的应用。第五部分给出了实现细节。最后，我们提出实验结果在第VII部分结束论文之前，第六部分对具有挑战性的城市情景进行了研究。

II. RELATED WORK

Obstacle avoidance is one of the central capabilities of any autonomous mobile system. Many systems are building up occupancy maps [7] for this purpose. An exhaustive review can be found in [28]. While such techniques are geared towards static obstacles, a main challenge is to accurately detect moving objects in the scene. Such objects can be extracted independent of their category by modeling the shape of the road surface and treating everything that does not fit that model as an object (e.g. in [19], [26], [33]). However, such simple approaches break down in crowded situations where not enough of the ground may be visible. More accurate detections can be obtained by applying category-specific models, either directly on the camera images [5], [16], [25], [31], on the 3D depth information [1] or both in combination [9], [12], [27].

避障是任何自主移动系统的核心功能之一。许多系统正在为此目的建立占用地图[7]。详尽的介绍可以在[28]中找到。虽然这些技术面向静态障碍，但主要的挑战是准确地检测场景中的移动物体。通过对路面的形状进行建模并将不适合该模型的所有东西作为对象进行处理，可以独立于其类别提取这些对象（例如，在[19]，[26]，[33]中）。然而，这种简单的方法在拥挤的情况下崩溃，其中没有足够的地面可见。通过直接在相机图像[5]，[16]，[25]，[31]，3D深度信息[1]或组合[9]两者中应用类别特定模型，可以获得更准确的检测，[12]，[27]。

Tracking detected objects over time presents additional challenges due to the complexity of data association in crowded scenes. Targets are typically followed using classic tracking approaches such as Extended Kalman Filters (EKF), where data assignment is optimized using Multi-Hypothesis Tracking (MHT) [4], [22] or Joint Probabilistic Data Asso-ciation Filters (JPDAF) [11]. Several robust approaches have been proposed based on those components either operating on depth measurements [23], [24], [29] or as tracking-by-detection approaches from purely visual input [13], [17], [31], [32]. The approach employed in this paper is based on our previous work [17]. It works online and simultaneously optimizes detection and trajectory estimation for multiple interacting objects and over long time windows, by operating in a hypothesis selection framework.

III. SYSTEM

Our vision system is designed for a mobile platform equipped with a pair of forward-looking cameras. Altogether, we report experimental results for three different such platforms, shown in Fig. 2. In this paper, we only use visual appearance and stereo depth, and integrate different components for ground plane and ego-motion estimation, object detection, tracking, and occupied area prediction.

我们的视觉系统专为配备一对前视摄像头的移动平台而设计。总而言之，我们报告了三种不同此类平台的实验结果，如图2所示。在本文中，我们仅使用视觉外观和立体深度，并集成不同的组件用于地平面和自我运动估计，物体检测，跟踪和占地面积预测。

Fig. 2. Mobile recording platforms used in our experiments. Note that in this paper we only employ image information from a stero camera pair and do not make use of other sensors such as GPS or LIDAR.

图2.我们实验中使用的移动记录平台。请注意，在本文中，我们仅使用来自立体相机对的图像信息，而不使用其他传感器，如GPS或LIDAR。

Fig. 3(a) gives an overview of the proposed vision system. For each frame, the blocks are executed as follows. First, a depth map is calculated and the new frame’s camera pose is predicted. Then objects are detected together with the supporting ground surface, taking advantage of appearance, depth, and previous trajectories. The output of this stage, along with predictions from the tracker, helps stabilize visual odometry, which updates the pose estimate for the platform and the detections, before running the tracker on these updated detections. As a final step, we use the estimated trajectories in order to predict the future locations for dynamic objects and fuse this information with a static occupancy map. The whole system is held entirely causal, i.e. at any point in time it only uses information from the past and present.

图3（a）给出了所提出的视觉系统的概述。对于每个帧，块执行如下。首先，计算深度图并预测新帧的相机姿势。然后利用外观，深度和先前的轨迹，与支撑地面一起检测物体。此阶段的输出以及来自跟踪器的预测有助于稳定视觉测距，在更新检测器上运行跟踪器之前更新平台和检测的姿势估计。作为最后一步，我们使用估计的轨迹来预测动态对象的未来位置，并将此信息与静态占用图融合。整个系统完全是因果关系，即在任何时间点它只使用过去和现在的信息。

Fig. 3. (a) Flow diagram for our vision system. (b) Graphical model for

tracking-by-detection with additional depth information (see text for details).

图3.（a我们视觉系统的流程图。（b通过附加深度信息进行检测的图形模型（详见文本）。

For the basic tracking-by-detection components, we rely on the framework described in [8]. The main contribution of this paper is to extend this framework to the prediction of future spatial occupancy for both static and dynamic objects. The following subsections describe the main system components and give details about their robust implementation.

对于基本的检测跟踪组件，我们依赖于[8]中描述的框架。本文的主要贡献是将此框架扩展到预测静态和动态对象的未来空间占用率。以下小节描述了主要系统组件，并提供了有关其强大实现的详细信息。

Coupled Object Detection and Ground Plane Estimation

A、耦合目标检测与地平面估计

Instead of directly using the output of an object detector for the tracking stage, we introduce scene knowledge to reduce false positives. For this, we assume a simple scene model where all objects of interest reside on a common ground plane. As a wrong estimate of this ground plane has far-reaching consequences for all later stages, we try to avoid making hard decisions here and instead model the coupling between object detections and the scene geometry probabilistically using a Bayesian network (see Fig. 3(b)). This network is constructed for each frame and models the dependencies between object hypotheses oi, object depth，di,and the ground planeπusing evidence from the image I, the depth map D, a stereo self-occlusion map O, and the ground plane evidenceπD in the depth map. Following standard graphical model notation, the plate indicates repetition of the contained parts for the number of objects n.

我们不是直接将对象检测器的输出用于跟踪阶段，而是引入场景知识来减少误报。为此，我们假设一个简单的场景模型，其中所有感兴趣的对象都位于公共地平面上。由于这个地平面的错误估计对所有后期阶段都有深远的影响，我们试图避免在此做出艰难的决定，而是使用贝叶斯网络模拟对象检测与场景几何之间的耦合（见图3（b）））。该网络是针对每个帧构建的，并利用来自图像I，深度图D，立体自遮挡图O和地平面证据πD的证据来模拟对象假设oi，对象深度di和地平面之间的依赖关系。深度图。按照标准的图形模型表示法，该平面表示对象数量的重复所包含的部分 n。

In this model, an object’s probability depends both on its geometric world position and size (expressed by P(oi|π)),on its correspondence with the depth map P(oi|di), and on P(I|oi), the object likelihood estimated by the object detector. The likelihood P(πD|π) of each candidate ground plane is modeled by a robust estimator taking into account the uncertainty of the inlier depth points. The prior P(π), as well as the conditional probability tables, are learned from a training set.

在该模型中，对象的概率取决于其几何世界位置和大小（由P（oi |π）表示），与深度图P（oi | di）的对应关系以及P（I | oi），物体检测器估计的物体似然。考虑到内部深度点的不确定性，通过鲁棒估计器对每个候选地平面的似然性P（πD|π）进行建模。先前的P（π）以及条件概率表是从训练集中学习的。

In addition, we introduce temporal dependencies, indicated by the dashed arrows in Fig. 3(b). For the ground plane, we propagate the state from the previous frame as a temporal Prior P(π|πt−1) = (1−α)P(π)+αP(πt−1) that stabilizes the per-frame information from the depth map P(πD|π). For the detections, we add a spatial prior for object locations that are supported by tracked candidate trajectories Ht0:t−1.As shown in Fig. 3(b), this dependency is not a first-order Markov chain, but reaches many frames into the past, as a consequence of the tracking framework explained in Section III-B.

此外，我们还引入了时间依赖性，如图3(b)中虚线箭头所示。地平面,我们传播状态与前一帧的时间之前P(π|πt 1) =(1α)P(π)+αP(πt 1)稳定的每帧信息深度图P(πD |π)。对于检测，我们为被跟踪候选轨迹Ht0: t1支持的目标位置添加了一个空间先验。如图3(b)所示，这种依赖关系并不是一阶马尔可夫链，而是通过第三- b节所述的跟踪框架，延伸到过去的许多帧。

The advantage of this Bayesian network formulation is that it can operate in both directions. Given a largely empty scene where depth estimates are certain, the ground plane can significantly constrain object detection. In more crowded situations where less of the ground is visible, on the other hand, the object detector provides sufficient evidence to assist ground plane estimation.

这种贝叶斯网络公式的优点是它可以双向操作。给定一个很大程度上是空的场景，其中深度估计是确定的，地平面可以显著地约束目标检测。另一方面，在地面不那么明显的拥挤情况下，目标检测器提供了足够的证据来辅助地面平面估计。

Tracking, Prediction

After passing the Bayesian network, object detections are placed into a common world coordinate system using camera positions estimated from visual odometry. The actual tracking system follows a multi-hypotheses approach, similar to the one described in [17]. We do not rely on background modeling, but instead accumulate the detections of the current and past frames in a space-time volume. This volume is analyzed by growing many trajectory hypotheses using independent bi-directional Extended Kalman filters(EKFs) with a holonomic constant-velocity model. While the inclusion of further motion models, as e.g. done in [27], would be possible, it proved to be unnecessary in our case.

通过贝叶斯网络后，利用视觉测地学估计的摄像机位置，将目标检测置入一个共同的世界坐标系中。实际的跟踪系统遵循多假设方法，类似于[17]中描述的方法。我们不依赖于背景建模，而是将当前帧和过去帧的检测累积到一个时空矩阵中。利用独立的双向扩展卡尔曼滤波器(EKFs)和完整的恒速模型，通过增长许多轨迹假设来分析这个矩阵。同时还包括进一步的运动模

By starting EKFs from detections at different time steps, an overcomplete set of trajectories is obtained, which is then pruned to a minimal consistent explanation using model selection. This step simultaneously resolves conflicts from overlapping trajectory hypotheses by letting trajectories compete for detections and space-time volume. In a nutshell, the pruning step employs quadratic pseudo-boolean optimization to pick the set of trajectories with maximal joint probability, given the observed evidence over the past frames. This probability.

通过从不同时间步长的检测开始EKFs，得到了一组过完备的轨迹集，然后使用模型选择将其修剪为最小的一致性解释。这一步骤通过让轨迹竞争探测和时空体积，同时解决了重叠轨迹假设的冲突。简而言之，剪枝步骤使用二次伪布尔优化来选择具有最大联合概率的轨迹集，给定在过去帧中观察到的证据。这个概率。

•increases as the trajectories explain more detections and as they better fit the detections’ 3D location and 2D appearance through the individual contribution of each detection;

当轨迹解释更多的检测时，随着轨迹的增加，通过每个检测的单独贡献，轨迹更适合检测的三维位置和二维外观.

•decreases when trajectories are (partially) based on the same object detections through pairwise corrections to the trajectories’ joint likelihoods (these express the constraints that each pedestrian can only follow one trajectory and that two pedestrians cannot be at the same location at the same time);

当轨迹(部分)通过对轨迹的关节概率进行两两的修正，基于相同的目标检测时，轨迹会减小(这表示每个行人只能沿着一条轨迹行走，而两个行人不可能同时在同一位置)。

•decreases with the number of required trajectories through a prior favoring explanations with fewer trajectories – balancing the complexity of the explanation against its goodness-of-fit in order to avoid over-fitting(“Occam’s razor”).

为了避免过拟合(奥卡姆剃刀原则)，通过优先选择轨迹较少的解释来平衡解释的复杂性和拟合优度，从而减少所需轨迹的数量。

For the mathematical details, we refer to [17]. The most important features of this method are automatic track initialization (usually, after about 5 detections) and the ability to recover from temporary track loss and occlusion.

关于数学细节，我们参考[17]。该方法最重要的特点是自动跟踪初始化(通常在大约5次检测之后)和从临时的跟踪丢失和阻塞中恢复的能力。

The selected trajectories H are then used to provide a spatial prior for object detection in the next frame. This prediction has to take place in the world coordinate system, so tracking critically depends on an accurate and smooth ego-motion（帧间运动） estimate.

选择的轨迹H用于提供下一帧目标检测的空间先验。这个预测在世界坐标系,所以跟踪关键取决于一个准确和顺利的ego-motion(帧间运动)估计。

Visual Odometry

To allow reasoning about object trajectories in the world coordinate system, the camera position for each frame is estimated using visual odometry. The employed approach builds upon previous work by [8], [20]. In short, each incoming image is divided into a grid of 10×10 bins, and an approximately uniform number of points is detected in each bin using a Harris corner detector with locally adaptive thresholds. The binning encourages a feature distribution suitable for stable localization. To reduce outliers in RANSAC, we mask out corners that coincide with predicted object locations from the tracker output and are hence not deemed suitable for localization, as shown in Fig. 4.

为了便于对物体在世界坐标系中的运动轨迹进行推理，利用视觉里程法对每一帧的摄像机位置进行估计。所采用的方法建立在[8]，[20]先前工作的基础上。简而言之，每个接收到的图像被划分为一个由10x10的容器组成的网格，并且使用带有局部自适应阈值的Harris角检测器在每个容器中检测出大约一致数量的点。binning支持适合于稳定本地化的特性分布。为了减少RANSAC中的离群值，我们将tr中与预测目标位置相吻合的角掩码出来

Fig. 4. Visual odometry and occupancy maps are only based on image parts not explained by tracked objects, i.e. the parts we believe to be static. Left: original image with detected features. Right: image when features on moving objects (green) are ignored.

图4所示。视觉检测和占位图只基于图像部分，没有被跟踪对象解释，也就是我们认为是静态的部分。左:检测到特征的原始图像。右:当运动物体(绿色)上的特征被忽略时的图像。

In the initial frame, stereo matching and triangulation provide a first estimate of the 3D structure. In subsequent frames, we use 3D-2D matching to get correspondences, followed by camera resection (3-point pose) with RANSAC. Old frames (t′< t−15) are discarded, along with points that are only supported by those removed frames. To guarantee robust performance, we introduce an explicit failure detection mechanism based on the covariance of the estimated camera position, as described in [8]. In case of failure, a Kalman filter estimate is used instead of the measurement, and the visual odometry is restarted from scratch. This allows us to keep the object tracker running without resetting it. While such a procedure may introduce a small drift, a locally smooth trajectory is more important for our application. In fact, driftless global localization would require additional input from other sensors such as a GPS.

在初始帧中，立体匹配和三角剖分提供了三维结构的初步估计。在后续的帧中，我们使用3D-2D匹配获取对应，然后用RANSAC进行相机切除(三点位姿)。旧帧(t< t-15)将被丢弃，同时丢弃的还有仅由那些被删除的帧支持的点。为了保证鲁棒性，我们引入了一种基于估计摄像机位置协方差的显式故障检测机制，如[8]所述。在失败的情况下，用卡尔曼滤波估计代替测量和目视测距. 这允许我们在不重置对象跟踪器的情况下保持对象跟踪器的运行。虽然这样的程序可能会引入一个小漂移，但局部平滑的轨迹对我们的应用更重要。事实上，无漂移全球定位需要其他传感器(如GPS)的额外输入。

IV. OCCUPANCY MAP AND FREE SPACE PREDICTION

For actual path planning, the construction of a reliable occupancy map is of utmost importance. We split this in two parts according to the static scene and the dynamically moving objects.

对于实际的路径规划，建立一个可靠的占用图是至关重要的。我们根据静态场景和动态移动对象将其分为两部分。

For static obstacles, we construct a stochastic occupancy map based on the algorithm from [2]. In short, incoming depth maps are projected onto a polar grid on the ground and are fused with the integrated and transformed map from the previous frames. Based on this, free space for driving can be computed using dynamic programming. While [2] integrate entire depth maps (including any dynamic objects) for the construction of the occupancy map, we opt to filter out these dynamic parts. As in the connection with visual odometry（测距）, we use the tracker prediction as well as the current frame’s detections to mask out any non-static parts. The reasons for this are twofold: first, integrating non-static objects can result in a smeared occupancy map. Second, we are not only interested in the current position of the dynamic parts, but also in their future locations. For this, we can use accurate and category-specific motion models inferred from（推断自） the tracker.

对于静态障碍物，我们构建了基于[2]算法的随机占用图。简而言之，传入的深度图被投影到地面上的极坐标网格上，并与来自先前帧的集成和变换图相融合。基于此，可以使用动态编程来计算用于驾驶的自由空间。虽然[2]整合了整个深度图（包括任何动态对象）来构建占用地图，但我们选择过滤掉这些动态部分。与视觉测距相关，我们使用跟踪器预测以及当前帧的检测来屏蔽任何非静态部分。其原因有两个：首先，集成非静态对象可能会导致占用地图模糊。其次，我们不仅对动态零件的当前位置感兴趣，而且对未来的位置感兴趣。为此，我们可以使用从跟踪器推断的准确和类别特定的运动模型。

Dynamic Obstacles. As each object selected by the tracker is modeled by an independent EKF, we can predict its future position and obtain the corresponding uncertainty C. Choosing a bound on the positional uncertainty then yields an ellipse where the object will reside with a given probability. In our experiments, a value of 99% resulted in a good compromise between safety from collision and the need to leave a navigable path for the robot to follow. For the actual occupancy map, we also have to take into consideration the object’s dimensions and, in case of an anisotropic “footprint”, the bounds for its rotation. We assume pedestrians to have a circular footprint, so the final occupancy cone can be constructed by adding the respective radius to the uncertainty ellipse. In our visualization, we show the entire occupancy cone for the next second, i.e. the volume the pedestrian is likely to occupy within that time.

动态障碍。当跟踪器选择的每个对象由独立的EKF建模时，我们可以预测其未来位置并获得相应的不确定性C.选择位置不确定性的界限然后产生椭圆，其中对象将以给定的概率驻留。在我们的实验中，99％的值导致了碰撞安全性和为机器人留下可导航路径的需求之间的良好折衷。对于实际占用地图，我们还必须考虑对象的尺寸，并且在各向异性“足迹”的情况下，还要考虑其旋转的界限。我们假设行人具有圆形足迹，因此可以通过将相应半径添加到不确定性椭圆来构造最终占用锥。在我们的可视化中，我们显示下一秒的整个占用锥，即行人在该时间内可能占据的体积。

Based on this predicted occupancy map, free space for driving can be computed with the same algorithm as in [2], but using an appropriate prediction horizon. Note that in case a person was not tracked successfully, it will still occur in the static occupancy map, as a sort of graceful degradation of the system.

基于该预测的占用率图，可以使用与[2]中相同的算法计算用于驾驶的自由空间，但是使用适当的预测范围。请注意，如果没有成功跟踪某个人，它仍会出现在静态占用地图中，这是系统的一种优雅降级。

V. DETAILED IMPLEMENTATION

详细的实施

The system’s parameters were trained on a sequence with 490 frames, containing 1’578 annotated pedestrian bounding boxes. In all experiments, we used data recorded at a resolution of 640×480 pixels (bayered) at 13–14 fps, with a camera baseline of 0.4 and 0.6 meters for the child stroller and car setups, respectively.

系统的参数在490帧的序列上训练，包含1'578个带注释的行人边界框。在所有实验中，我们使用分辨率为640×480像素（海湾）以13-14 fps记录的数据，分别为儿童婴儿车和汽车设置提供0.4和0.6米的摄像头基线。

Ground Plane. For training, we infer the ground plane directly from D using Least-Median-of-Squares (LMedS), with bad estimates discarded manually. Related but less general methods include e.g. the v-disparity analysis [15]. For tractability, the ground plane parameters (θ, φ, π4) are discretized into a 6×6×20 grid, with bounds inferred from the training sequences. The training sequences also serve to construct the prior distribution P(π).

地平面。对于训练，我们使用最小中值平方（LMedS）直接从D推断地平面，并且手动丢弃不良估计。相关但不太通用的方法包括例如视差分析[15]。为了易处理性，地平面参数（θ，φ，π4）被离散化为6×6×20网格，从训练序列推断出界限。训练序列还用于构造先验分布P（π）。

Object Hypotheses.Our system is independent of a specific detector choice. In the experiments presented here, we use a publicly available detector based on a Histogram-of-Oriented-Gradients representation [5]. The detector is run with a low confidence threshold to retain the necessary flexibility—in the context of the additional evidence we are using, final decisions based only on appearance would be premature. The range of detected scales corresponds to pedestrian heights of 60–400 pixels. The object size distribution is modeled as a Gaussian N (1.7,0.0852)[m], as in [14]. The depth distribution is assumed uniform in the system’s operating range of 0.5–30[m], respectively 60 [m] for the car setup.

对象假设。我们的系统独立于特定的检测器选择。在这里提供的实验中，我们使用基于直方图 - 梯度 - 直线表示的公开可用的检测器[5]。检测器以低置信度阈值运行以保持必要的灵活性 - 在我们使用的其他证据的背景下，仅基于外观的最终决策为时过早。检测到的刻度范围对应于60-400像素的行人高度。对象大小分布被建模为高斯N（1.7,0.0852）[m]，如[14]中所示。对于汽车设置，假设深度分布在系统的0.5-30 [m]的操作范围内均匀，分别为60 [m]。

Depth Cues. The depth map D for each frame is obtained with a publicly available, belief-propagation-based disparity estimation software [10]. All results reported in this paper are based on this algorithm. In the meantime, we have also experimented with a fast GPU-based depth estimator, which seems to achieve similar system-level accuracy. However, we still have to verify those results in practice. For verifying detections by depth measurements in the Bayesian network, we consider the agreement of the measured mean depth inside the detection bounding box with the ground-plane distance to the bounding box foot-point. As the detector’s bounding box placement is not always accurate, we allow the Bayesian network to “wiggle around” the bounding boxes slightly in order to improve goodness of fit. The final classifier for an object’s presence is based on the number of inlier depth points and is learned from training data using logistic regression.

利用可公开获得的基于置信传播的视差估计软件[10]获得每帧的深度图D.本文报道的所有结果均基于该算法。与此同时，我们还尝试了基于GPU的快速深度估算器，它似乎达到了类似的系统级精度。但是，我们仍然需要在实践中验证这些结果。为了通过贝叶斯网络中的深度测量来验证检测，我们考虑在检测边界框内测量的平均深度与到边界框脚点的地平面距离的一致性。由于探测器的边界框位置并不总是准确的，我们允许贝叶斯网络稍微“摆动”边界框以提高拟合度。对象存在的最终分类器基于内部深度点的数量，并使用逻辑回归从训练数据中获知。

Belief Propagation. The network of Fig. 3 is constructed for each frame, with all variables modeled as discrete（离散的） entities and their conditional probability tables defined as described above. Inference is conducted using Pearl’s Belief Propagation [21]. For efficiency reasons, the set of possible ground planes is pruned to the 20% most promising ones (according to prior and depth information).

置信传播。为每个帧构建图3的网络，其中所有变量被建模为离散实体，并且它们的条件概率表如上所述定义。使用Pearl's Belief Propagation [21]进行推理。出于效率原因，可能的地平面集被修剪到20％最有希望的地平面（根据先验和深度信息）。

VI. RESULTS

In order to evaluate our vision system, we applied it to three test sequences, showing strolls and drives through busy pedestrian zones. The sequences were acquired with the platforms seen in Fig. 2.1 The first test sequence (“Seq. #1”), recorded with platform (a), shows a walk over a crowded square, extending over 230 frames. The second sequence (“Seq. #2”), recorded with platform (b) at considerably worse image contrast, contains 5’193 pedestrian annotations in 999 frames. The third test sequence (“Seq. #3”) consists of 800 frames and was recorded from a car passing through a crowded city center, where it had to stop a few times to let people pass. We annotated pedestrians in every fourth frame, resulting in 960 annotations for this sequence.

为了评估我们的视觉系统，我们将其应用于三个测试序列，显示漫步和驾驶通过繁忙的行人区。序列是用图2.1所示的平台获得的第一个测试序列（SEQ）。用平台（A）记录的“1”，显示在一个拥挤的广场上漫步，延伸超过230帧。第二序列（SEQ）。#2")用平台(b)以相当差的图像对比度记录，在999帧中包含5’193行人注释。第三个测试序列（SEQ）。#3")由800帧组成，是从一辆经过拥挤城市中心的汽车上录下来的，为了让人们通过，它必须停几次。我们在每第四帧中对行人进行注释，由此产生960个注释。

For a quantitative evaluation, we measure bounding box overlap in each frame and plot recall over false positives per image for three stages of our system. The results of this experiment are shown in Fig. 5(left, middle). The plot compares the raw detector output, the intermediate output of the Bayesian network, and the final tracking output. As can be seen, discarding detections that are not in accordance with the scene by the Bayesian network greatly reduces false positives with hardly any impact on recall. The tracking stage additionally improves the results and in most cases achieves a higher performance than the raw detector. It should be noted, though, that a single-frame comparison is not entirely fair here, since the tracker requires some detections to initialize (losing recall) and reports tracking results through occlusions (losing precision if the occluded persons are not annotated). However, the tracking stage provides the necessary temporal information that makes the entire motion prediction system at all possible. The blue curves in Fig. 5 show the performance on all annotated pedestrians. When only considering the immediate range up to 15m distance (which is suitable for a speed of 30 km/h in inner-city scenarios), performance is considerably better, as indicated by the red curves.

对于定量评估，我们测量每个帧中的边界框重叠，并绘制对于我们系统的三个阶段的每个图像的误报的记录。该实验的结果如图5（左，中）所示。该图比较了原始检测器输出，贝叶斯网络的中间输出和最终跟踪输出。可以看出，丢弃与贝叶斯网络不一致的检测结果可以大大减少误报，几乎不会对召回产生任何影响。跟踪阶段还改善了结果，并且在大多数情况下实现了比原始检测器更高的性能。应该注意的是，单帧比较在这里并不完全公平，因为跟踪器需要一些检测来初始化（失去召回）并通过遮挡报告跟踪结果（如果没有注释被遮挡的人则丢失精度）。然而，跟踪阶段提供必要的时间信息，使得整个运动预测系统完全可能。图5中的蓝色曲线显示了所有带注释的行人的表现。当仅考虑距离最远15米的直接距离（适用于城市内场景中30 km / h的速度）时，性能要好得多，如红色曲线所示。

To assess the suitability of our system for path planning, we investigate the precision of the motion prediction for increasing time horizons. This precision is especially interesting, since it allows to quantify the possible advantage over system modeling only static obstacles. Specifically, we compare the bounding boxes obtained from the tracker’s prediction with the actual annotations in the frame and count the fraction of false positives (1−prec). The results can be seen in Fig. 5(right). As expected, precision drops with increasing lookahead time, but stays within acceptable limits for a prediction horizon≤1s (12 frames). Note that this plot should only be taken qualitatively: a precision of 0.9 does not imply an erroneous replanning every 10th frame, as many of the predicted locations do not affect the planned path. Rather, this experiment shows that for reasonable prediction

horizons, the precision does not drop considerably.

为了评估我们的系统在路径规划中的适用性，我们研究了增加时间范围的运动预测的精度。这种精度特别有趣，因为它可以量化仅比静态障碍物系统建模的可能优势。具体来说，我们将从跟踪器预测中获得的边界框与帧中的实际注释进行比较，并计算误报的分数（1-prec）。结果见图5（右）。正如预期的那样，精度随着预测时间的增加而下降，但仍然在预测范围≤1s（12帧）的可接受范围内。请注意，此绘图应仅定性地进行：精度为0.9并不意味着每10帧重新进行错误重放，因为许多预测位置不会影响计划路径。相反，该实验表明，对于合理的预测角度，精度不会显着下降。

Example tracking results for Seq. #1 are shown in Fig. 6. The operating point for generating those results was the same as the one used in Fig. 5(right). Recorded on a busy city square, many people interact in this scene, moving in all directions, stopping abruptly (e.g. the first orange box), and frequently occluding each other (see e.g. the second orange box). The bounding boxes are color coded to show the tracked identities (due to the limited palette, some color labels repeat). Below each image, we show the inferred dynamic obstacle map in an overhead view. Static obstacles are marked in black; each tracked pedestrian is entered with its current position and the predicted occupancy cone for the next second (for standing pedestrians, this cone reduces to a circle). As can be seen, our system is able to track most of the visible pedestrians correctly and to accurately predict their future motion.

示例跟踪结果为SEQ。图6示出了1。产生这些结果的操作点与图5（右）中使用的操作点相同。记录在繁忙的城市广场上，许多人在这个场景中互动，向四面八方移动，突然停止（例如，第一个橙色盒子），并且经常互相遮挡（参见，例如，第二个橙色盒子）。边界框被彩色编码以显示跟踪的标识（由于调色板有限，一些颜色标签重复）。在每个图像下，我们显示推断的动态障碍地图在俯瞰图。静态障碍物用黑色标示；每个被跟踪的行人被输入其当前位置和预测下一秒的占用圆锥体（对于站立的行人，这个圆锥体减少为一个圆）。可以看出，我们的系统能够正确地跟踪大多数可见行人，并准确地预测他们未来的运动。

Fig. 7 shows more results for Seq. #2. Note that both adults and children are identified and tracked correctly even though they differ considerably in their appearance. In the bottom row of the figure, a man in pink walks diagonally towards the camera. Without motion prediction, a following navigation module might issue an unnecessary stop here. However, our system correctly determines that he presents no danger of collision and resolves this situation. Also note how the standing woman in the white coat gets integrated into the static occupancy map as soon as she is too large to be detected. This is a safe fallback in the design of our system—when no detections are available, its results simply revert to those of a depth-integration based occupancy map.

图7示出了SEQ的更多结果。2。请注意，成人和儿童被正确地识别和追踪，即使他们在外表上有很大差异。在图的底部，一个穿着粉红色衣服的男人斜向摄像机方向走去。如果没有运动预测，下面的导航模块可能会发出一个不必要的停止。但是，我们的系统正确地确定他没有发生碰撞的危险，并解决了这种情况。还要注意身穿白大衣的站立妇女一旦太大而不能被发现，如何被整合到静态占用地图中。在我们的系统设计中，这是一个安全的后退——当没有可用的检测时，它的结果简单地恢复到基于深度集成的占用图的结果。

Finally, Fig. 8 demonstrates the vision system in a car application. Compared to the previous sequences, the viewpoint is quite different, and faster scene changes result in fewer data points for creating trajectories. Still, stable tracking performance can be obtained also for quite distant pedestrians.

最后，图8展示了汽车应用中的视觉系统。与先前的序列相比，视点是完全不同的，并且更快的场景变化导致用于创建轨迹的更少的数据点。对于相当遥远的行人，仍然可以获得稳定的跟踪性能。

System Performance

Apart from the object detectors, the entire system is implemented in an integrated fashion in C/C++, with several procedures taking advantage of GPU processing. For the complex parts of Seq. #3 (15 simultaneous objects), we can achieve processing times of around 400 ms per frame on an Intel Core2 CPU 6700, 2.66GHz, NVidia GeForce 8800 (see Tab. I). While the detector stage is the current bottleneck (the detector was run offline and needed about 30 seconds per image), we want to point out that for the HOG detector, real-time GPU implementations exist [30], which could be substituted to remove this restriction.

除了对象检测器之外，整个系统以C/C++的集成方式实现，并利用GPU处理的多个过程。对于SEQ的复杂零件。#3（15个同时存在的对象），我们可以在Intel Core2 CPU 6700、2.66GHz、NVidia GeForce 8800上实现每帧约400ms的处理时间（参见附表）。我）虽然检测器阶段是当前的瓶颈（检测器离线运行并且每幅图像需要大约30秒），但我们要指出的是，对于HOG检测器，存在实时GPU实现[30]，这可以被替代以去除此限制。

VII. CONCLUSION

In this paper, we have presented a mobile vision system for the creation of dynamic obstacle maps for automotive or mobile robotics platforms. Such maps should provide valuable input for actual path planning algorithms [18]. Our approach relies on a robust tracking system that closely integrates different modules (appearance-based object detection, depth estimation, tracking, and visual odometry). To resolve the complex interactions that occur between pedestrians in urban scenarios, a multi-hypothesis tracking approach is employed. The inferred predictions can then be used to extend a static occupancy map generation system to a dynamic one, which then allows for more detailed path planning. The resulting system can handle very challenging scenes and delivers accurate predictions for many simultaneously tracked objects.

在本文中，我们提出了一种移动视觉系统，用于为汽车或移动机器人平台创建动态障碍物图。这些地图应为实际路径规划算法提供有价值的输入[18]。我们的方法依赖于强大的跟踪系统，该系统紧密集成了不同的模块（基于外观的物体检测，深度估计，跟踪和视觉测距）。为了解决城市场景中行人之间发生的复杂相互作用，采用了多假设跟踪方法。然后可以使用推断的预测将静态占用图生成系统扩展到动态占用图，然后允许更详细的路径规划。最终的系统可以处理非常具有挑战性的场景，并为许多同时跟踪的对象提供准确的预测。

In future work, we plan to optimize the individual system components further with respect to run-time and performance. As discussed before, system operation at 2-3 fps is already reachable now, but additional improvements are necessary for true real-time performance. In addition, we plan to improve the trajectory analysis by including more elaborate motion models and to combine it with other sensing modalities such as GPS and LIDAR.

在未来的工作中，我们计划在运行时和性能方面进一步优化各个系统组件。如前所述，现在已经可以达到2-3fps的系统运行，但是真正的实时性能还需要进一步的改进。此外，我们计划通过包含更精细的运动模型并将其与其他传感模式（如GPS和激光雷达）相结合来改进轨迹分析。

REFERENCES

[1] K. O. Arras, O. M. Mozos, and W. Burgard. Using boosted features for the detection of people in 2d range data. In ICRA, 2007.

[2] H. Badino, U. Franke, and R. Mester. Free space computation using stochastic occupancy grids and dynamic programming. In ICCV Workshop on Dynamical Vision (WDV) , 2007.

[3] M. Betke, E. Haritaoglu, and L. S. Davis. Real-time multiple vehicle tracking from a moving vehicle. MVA , 12(2):69–83, 2000.

[4] I. J. Cox. A review of statistical data association techniques for motion correspondence. IJCV , 10(1):53–66, 1993.

[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.

[6] DARPA. DARPA urban challenge rulebook. In Webpage , 2008. http://www.darpa.mil/GRANDCHALLENGE/docs/Urban_Challenge_Rules_102707.pdf.

[7] A. Elfes. Sonar-based real-world mapping and navigation. IEEE Journal of Robotics and Automation, 3(3):249–265, 1987.

[8] A. Ess, B. Leibe, K. Schindler, and L. van Gool. A mobile vision system for robust multi-person tracking. In CVPR , 2008.

[9] A. Ess, B. Leibe, and L. van Gool. Depth and appearance for mobile scene analysis. In ICCV , 2007.

[10] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation

for early vision. IJCV, 70:41–54, 2006.

[11] T. E. Fortmann, Y. Bar Shalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association.IEEE J. Oceanic Engineering, 8(3):173–184, 1983.

[12] D. Gavrila and V. Philomin. Real-time object detection for ”smart” vehicles. In ICCV , pages 87–93, 1999.

[13] D. M. Gavrila and S. Munder. Multi-cue pedestrian detection and tracking from a moving vehicle. IJCV , 73:41–59, 2007.

[14] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. In CVPR, 2006.

[15] R. Labayrade, D. Aubert, and J.-P. Tarel. Real time obstacle detection on non flat road geometry through ’v-disparity’ representation. In IVS,2002.

[16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision , 77(1-3):259–289, May 2008.

[17] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool. Coupled detection and tracking from static cameras and moving vehicles. IEEE TPAMI , 30(10):1683–1698, 2008.

[18] K. Macek, A. D. Vasquez, T. Fraichard, and R. Siegwart. Safe vehicle navigation in dynamic urban scenarios. In ITSC , 2008.

[19] S. Nedevschi, R. Danescu, D. Frentiu, T. Graf, and R. Schmidt. High accuracy stereovision approach for obstacle detection on non-planar roads. In Proc IEEE Intelligent Engineering Systems , 2004.

[20] D. Nist ́er, O. Naroditsky, and J. R. Bergen. Visual odometry. In CVPR ,2004.

[21] J. Pearl. Probabilistic Reasoning in Intelligen Systems. Morgan Kaufmann Publishers Inc., 1988.

[22] D. B. Reid. An algorithm for tracking multiple targets. IEEE T. Automatic Control , 24(6):843–854, 1979.

[23] M. Scheutz, J. McRaven, and G. Cserey. Fast, reliable, adaptive, bimodal people tracking for indoor environments. In IROS , 2004.

[24] D. Schulz, W. Burgard, D. Fox, and A. Cremers. People tracking with mobile robots using sample-based joint probabilistic data association filters. IJRR , 22(2):99–116, 2003.

[25] A. Shashua, Y. Gdalyahu, and G. Hayun. Pedestrian detection for driving assistance systems: Single-frame classification and system level performance. In IVS , 2004.

[26] M. Soga, T. Kato, M. Ohta, and Y. Ninomiya. Pedestrian detection with stereo vision. In IEEE International Conf. on Data Engineering , 2005.

[27] L. Spinello, R. Triebel, and R. Siegwart. Multimodal people detection and tracking in crowded scenes. In Proc. of The AAAI Conference on Artificial Intelligence (Physically Grounded AI Track) , July 2008.

[28] S. Thrun. Probabilistic Robotics . The MIT Press, 2005.

[29] C.-C. Wang, C. Thorpe, and S. Thrun. Online simultaneous localization and mapping with detection and tracking of moving objects:Theory and results from a ground vehicle in crowded urban areas. In ICRA , 2003.

[30] C. Wojek, G. Dork ́o, A. Schulz, and B. Schiele. Sliding-windows for rapid object class localization: A parallel technique. In DAGM , 2008.

[31] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet part detectors. IJCV, 75(2):247–266, 2007.

[32] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008.

[33] L. Zhao and C. Thorpe. Stereo- and neural network-based pedestrian detection. In ITS , 2000.