【读论文】MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving

最新推荐文章于 2024-10-15 12:41:55 发布

小白有颗大白梦

最新推荐文章于 2024-10-15 12:41:55 发布

阅读量1.1k

点赞数 20

分类专栏：读论文NeRF NeRF学习文章标签： NeRF 计算机视觉

本文链接：https://blog.csdn.net/weixin_62012485/article/details/140439147

版权

NeRF学习同时被 2 个专栏收录

19 篇文章 3 订阅

订阅专栏

读论文NeRF

14 篇文章 1 订阅

订阅专栏

1. What

An autonomous driving simulator based on Nerf with three features: instance-aware, modular, and realistic.

2. Why

As for current autonomous driving algorithms, training the corner cases is helpful for their performance bottleneck.

Existing autonomous driving simulation methods have their own limitations, such as CARLA, AADS, and GeoSim.

Recently, Neural Scene Graph (NSG) decomposes dynamic scenes into learned scene graphs and learns latent representations for category-level objects. However, its multi-plane-based representation for background modeling cannot synthesize images under large viewpoint changes.

3. What

在这里插入图片描述

3.1 Inputs

The input to the system consists of a set of RGB-images $\{\mathcal{I}i\}^N$ , sensor poses $\{\mathcal{T}i\}^N$ (calculated using IMU/GPS signals), and object tracklets (including 3D bounding boxes $\{\mathcal{B}_{ij}\}^{N\times M}$ , categories $\{ \mathrm{type}_{ij}\} ^{N\times M}$ , and instance IDs $\{\mathrm{idx}_{ij}\}^{N\times M})$ . $N$ is the number of input frames and $M$ is the number of tracked instances $\{\mathcal{O}_j\}^M$ across the whole sequence. An optional set of depth maps $\{\mathcal{D}_i\}^N$ and semantic segmentation masks.

3.2 Scene Representation

Architectures: It supports various NeRF backbones, which can be roughly categorized into two hyper-classes: MLP-based methods, or grid-based methods and this paper gives a formal exposition of grid-based methods(Instant-ngp).

Foreground Nodes: Similar to NSG, it exploits latent codes to encode
instance features and shared category-level decoders to encode class-wise priors.

3.3 Compositional Rendering

It uses the standard volume rendering process to render pixel-wise properties:

$\mathbf{\hat{c}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{c}_{i}+(1-\mathrm{accum})\cdot\mathbf{c}_{\mathrm{sky}}, T_{i}=\exp(-\sum_{k=1}^{i-1}\sigma_{k}\delta_{k})\\\hat{d}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}t_{i}+(1-\mathrm{accum})\cdot\mathrm{inf}\\\mathbf{\hat{s}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{s}_{i}+(1-\mathrm{accum})\cdot\mathbf{s}_{\mathrm{sky}}.$

where $P_{i}\in sorted(\{P_{i}^{\mathrm{bg,~obj}}\})$ , $\alpha_{i}=1-\exp(-\sigma_{i}\delta_{i})$ , $\delta_{i}=t_{i+1}-t_{i}$ , $\sum_{P_{i}}T_{i}\alpha _{i}$ , $\mathbf{c}_{\mathrm{sky}}$ is the rendered color from the Sky model, $in f$ is the upper bound distance, and $s_{\mathrm{sky}}$ is the one-hot semantic logits of the sky category.

As for the segment, we assign a one-hot vector to every object.

$\left.\mathbf{s}_{k}^{\mathrm{obj-j}}[l]=\left\{\begin{array}{cc}\sigma_{k}^{\mathrm{obj-j}}&\quad\mathrm{if} l=\mathrm{category~of~j's~instance} \\0&\quad\mathrm{otherwise}\end{array}\right.\right.,\text{for l in category}.$

3.4 Towards Realistic Rendering

Sky Model

However, blending the sky color $c_{sky}$ with background and foreground
rendering (Eq. 4) leads to potential inconsistency. Therefore, we introduce a BCE(Binary Cross Entropy) semantic regularization to alleviate this issue:

$\mathcal{L}_{\mathrm{sky}}=\mathrm{BCE}(1-\mathrm{accum},\mathcal{S}_{\mathrm{sky}}).$
Resolving Conflict Samples:

Due to the fact that our background and foreground sampling are done independently, there is a chance that background samples fall within the foreground bounding box, causing incorrect classification of foreground samples as background samples.

The ambiguity is NOT observed in NSG [21] as NSG only samples a few points on the ray-plane intersections, and is unlikely to have much background truncated samples.

To address this issue, we devise a regularization term that minimizes the
density sum of background truncated samples to minimize their influence during the rendering process as:

$\mathcal{L}_{\mathrm{accum}}=\sum_{P_{i}^{(\mathrm{tr})}}\sigma_{i},$

where $\{P_i^{(\text{tr})}\}$ denotes background truncated samples.

3.5 Optimization

$\mathcal{L}=\lambda_1\mathcal{L}_{\mathrm{color}}+\lambda_2\mathcal{L}_{\mathrm{depth}}+\lambda_3\mathcal{L}_{\mathrm{sem}}+\lambda_4\mathcal{L}_{\mathrm{sky}}+\lambda_5\mathcal{L}_{\mathrm{accum}},$

where $\mathcal{L}_{\mathrm{depth}}$ is from Depth-supervised NeRF and $\mathcal{L}_{\mathrm{sem}}$ is from SemanticNeRF.

4. Experiment

Photorealistic Rendering

Dataset: KITTI+VKITTI
Instance-wise Editing
The blessing of modular design
Ablation Results

Unlike prior works [26SUDS, 15 Panoptic Neural Fields, 21NSG] that evaluate their method on a short sequence of 90 images, we use the full sequence from the dataset for all evaluations.