【读论文】MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving

1. What

An autonomous driving simulator based on Nerf with three features: instance-aware, modular, and realistic.

2. Why

As for current autonomous driving algorithms, training the corner cases is helpful for their performance bottleneck.

Existing autonomous driving simulation methods have their own limitations, such as CARLA, AADS, and GeoSim.

Recently, Neural Scene Graph (NSG) decomposes dynamic scenes into learned scene graphs and learns latent representations for category-level objects. However, its multi-plane-based representation for background modeling cannot synthesize images under large viewpoint changes.

3. What

在这里插入图片描述

3.1 Inputs

The input to the system consists of a set of RGB-images { I i } N \{\mathcal{I}i\}^N {Ii}N, sensor poses { T i } N \{\mathcal{T}i\}^N {Ti}N (calculated using IMU/GPS signals), and object tracklets (including 3D bounding boxes { B i j } N × M \{\mathcal{B}_{ij}\}^{N\times M} {Bij}N×M, categories { t y p e i j } N × M \{ \mathrm{type}_{ij}\} ^{N\times M} {typeij}N×M, and instance IDs { i d x i j } N × M ) \{\mathrm{idx}_{ij}\}^{N\times M}) {idxij}N×M). N N N is the number of input frames and M M M is the number of tracked instances { O j } M \{\mathcal{O}_j\}^M {Oj}M across the whole sequence. An optional set of depth maps { D i } N \{\mathcal{D}_i\}^N {Di}N and semantic segmentation masks.

3.2 Scene Representation

Architectures: It supports various NeRF backbones, which can be roughly categorized into two hyper-classes: MLP-based methods, or grid-based methods and this paper gives a formal exposition of grid-based methods(Instant-ngp).

Foreground Nodes: Similar to NSG, it exploits latent codes to encode
instance features and shared category-level decoders to encode class-wise priors.

3.3 Compositional Rendering

It uses the standard volume rendering process to render pixel-wise properties:

c ^ ( r ) = ∑ P i T i α i c i + ( 1 − a c c u m ) ⋅ c s k y , T i = exp ⁡ ( − ∑ k = 1 i − 1 σ k δ k ) d ^ ( r ) = ∑ P i T i α i t i + ( 1 − a c c u m ) ⋅ i n f s ^ ( r ) = ∑ P i T i α i s i + ( 1 − a c c u m ) ⋅ s s k y . \mathbf{\hat{c}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{c}_{i}+(1-\mathrm{accum})\cdot\mathbf{c}_{\mathrm{sky}}, T_{i}=\exp(-\sum_{k=1}^{i-1}\sigma_{k}\delta_{k})\\\hat{d}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}t_{i}+(1-\mathrm{accum})\cdot\mathrm{inf}\\\mathbf{\hat{s}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{s}_{i}+(1-\mathrm{accum})\cdot\mathbf{s}_{\mathrm{sky}}. c^(r)=PiTiαici+(1accum)csky,Ti=exp(k=1i1σkδk)d^(r)=PiTiαiti+(1accum)infs^(r)=PiTiαisi+(1accum)ssky.

where P i ∈ s o r t e d ( { P i b g ,   o b j } ) P_{i}\in sorted(\{P_{i}^{\mathrm{bg,~obj}}\}) Pisorted({Pibg, obj}), α i = 1 − exp ⁡ ( − σ i δ i ) \alpha_{i}=1-\exp(-\sigma_{i}\delta_{i}) αi=1exp(σiδi), δ i = t i + 1 − t i \delta_{i}=t_{i+1}-t_{i} δi=ti+1ti , a c c u m = ∑ P i T i α i accum= \sum_{P_{i}}T_{i}\alpha _{i} accum=PiTiαi , c s k y \mathbf{c}_{\mathrm{sky}} csky is the rendered color from the Sky model, i n f inf inf is the upper bound distance, and s s k y s_{\mathrm{sky}} ssky is the one-hot semantic logits of the sky category.

As for the segment, we assign a one-hot vector to every object.

s k o b j − j [ l ] = { σ k o b j − j i f l = c a t e g o r y   o f   j ′ s   i n s t a n c e 0 o t h e r w i s e , for l in category . \left.\mathbf{s}_{k}^{\mathrm{obj-j}}[l]=\left\{\begin{array}{cc}\sigma_{k}^{\mathrm{obj-j}}&\quad\mathrm{if} l=\mathrm{category~of~j's~instance} \\0&\quad\mathrm{otherwise}\end{array}\right.\right.,\text{for l in category}. skobjj[l]={σkobjj0ifl=category of js instanceotherwise,for l in category.

3.4 Towards Realistic Rendering

  1. Sky Model

    However, blending the sky color c s k y c_{sky} csky with background and foreground
    rendering (Eq. 4) leads to potential inconsistency. Therefore, we introduce a BCE(Binary Cross Entropy) semantic regularization to alleviate this issue:

    L s k y = B C E ( 1 − a c c u m , S s k y ) . \mathcal{L}_{\mathrm{sky}}=\mathrm{BCE}(1-\mathrm{accum},\mathcal{S}_{\mathrm{sky}}). Lsky=BCE(1accum,Ssky).

  2. Resolving Conflict Samples:

    Due to the fact that our background and foreground sampling are done independently, there is a chance that background samples fall within the foreground bounding box, causing incorrect classification of foreground samples as background samples.

    在这里插入图片描述

    The ambiguity is NOT observed in NSG [21] as NSG only samples a few points on the ray-plane intersections, and is unlikely to have much background truncated samples.

    To address this issue, we devise a regularization term that minimizes the
    density sum of background truncated samples to minimize their influence during the rendering process as:

    L a c c u m = ∑ P i ( t r ) σ i , \mathcal{L}_{\mathrm{accum}}=\sum_{P_{i}^{(\mathrm{tr})}}\sigma_{i}, Laccum=Pi(tr)σi,

    where { P i ( tr ) } \{P_i^{(\text{tr})}\} {Pi(tr)} denotes background truncated samples.

3.5 Optimization

L = λ 1 L c o l o r + λ 2 L d e p t h + λ 3 L s e m + λ 4 L s k y + λ 5 L a c c u m , \mathcal{L}=\lambda_1\mathcal{L}_{\mathrm{color}}+\lambda_2\mathcal{L}_{\mathrm{depth}}+\lambda_3\mathcal{L}_{\mathrm{sem}}+\lambda_4\mathcal{L}_{\mathrm{sky}}+\lambda_5\mathcal{L}_{\mathrm{accum}}, L=λ1Lcolor+λ2Ldepth+λ3Lsem+λ4Lsky+λ5Laccum,

where L d e p t h \mathcal{L}_{\mathrm{depth}} Ldepth is from Depth-supervised NeRF and L s e m \mathcal{L}_{\mathrm{sem}} Lsem is from SemanticNeRF.

4. Experiment

  1. Photorealistic Rendering

    Dataset: KITTI+VKITTI
    在这里插入图片描述

  2. Instance-wise Editing

  3. The blessing of modular design

  4. Ablation Results

    Unlike prior works [26SUDS, 15 Panoptic Neural Fields, 21NSG] that evaluate their method on a short sequence of 90 images, we use the full sequence from the dataset for all evaluations.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值