文章目录
1. What
An autonomous driving simulator based on Nerf with three features: instance-aware, modular, and realistic.
2. Why
As for current autonomous driving algorithms, training the corner cases is helpful for their performance bottleneck.
Existing autonomous driving simulation methods have their own limitations, such as CARLA, AADS, and GeoSim.
Recently, Neural Scene Graph (NSG) decomposes dynamic scenes into learned scene graphs and learns latent representations for category-level objects. However, its multi-plane-based representation for background modeling cannot synthesize images under large viewpoint changes.
3. What
3.1 Inputs
The input to the system consists of a set of RGB-images { I i } N \{\mathcal{I}i\}^N {Ii}N, sensor poses { T i } N \{\mathcal{T}i\}^N {Ti}N (calculated using IMU/GPS signals), and object tracklets (including 3D bounding boxes { B i j } N × M \{\mathcal{B}_{ij}\}^{N\times M} {Bij}N×M, categories { t y p e i j } N × M \{ \mathrm{type}_{ij}\} ^{N\times M} {typeij}N×M, and instance IDs { i d x i j } N × M ) \{\mathrm{idx}_{ij}\}^{N\times M}) {idxij}N×M). N N N is the number of input frames and M M M is the number of tracked instances { O j } M \{\mathcal{O}_j\}^M {Oj}M across the whole sequence. An optional set of depth maps { D i } N \{\mathcal{D}_i\}^N {Di}N and semantic segmentation masks.
3.2 Scene Representation
Architectures: It supports various NeRF backbones, which can be roughly categorized into two hyper-classes: MLP-based methods, or grid-based methods and this paper gives a formal exposition of grid-based methods(Instant-ngp).
Foreground Nodes: Similar to NSG, it exploits latent codes to encode
instance features and shared category-level decoders to encode class-wise priors.
3.3 Compositional Rendering
It uses the standard volume rendering process to render pixel-wise properties:
c ^ ( r ) = ∑ P i T i α i c i + ( 1 − a c c u m ) ⋅ c s k y , T i = exp ( − ∑ k = 1 i − 1 σ k δ k ) d ^ ( r ) = ∑ P i T i α i t i + ( 1 − a c c u m ) ⋅ i n f s ^ ( r ) = ∑ P i T i α i s i + ( 1 − a c c u m ) ⋅ s s k y . \mathbf{\hat{c}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{c}_{i}+(1-\mathrm{accum})\cdot\mathbf{c}_{\mathrm{sky}}, T_{i}=\exp(-\sum_{k=1}^{i-1}\sigma_{k}\delta_{k})\\\hat{d}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}t_{i}+(1-\mathrm{accum})\cdot\mathrm{inf}\\\mathbf{\hat{s}}(\mathbf{r})=\sum_{P_{i}}T_{i}\alpha_{i}\mathbf{s}_{i}+(1-\mathrm{accum})\cdot\mathbf{s}_{\mathrm{sky}}. c^(r)=Pi∑Tiαici+(1−accum)⋅csky,Ti=exp(−k=1∑i−1σkδk)d^(r)=Pi∑Tiαiti+(1−accum)⋅infs^(r)=Pi∑Tiαisi+(1−accum)⋅ssky.
where P i ∈ s o r t e d ( { P i b g , o b j } ) P_{i}\in sorted(\{P_{i}^{\mathrm{bg,~obj}}\}) Pi∈sorted({Pibg, obj}), α i = 1 − exp ( − σ i δ i ) \alpha_{i}=1-\exp(-\sigma_{i}\delta_{i}) αi=1−exp(−σiδi), δ i = t i + 1 − t i \delta_{i}=t_{i+1}-t_{i} δi=ti+1−ti , a c c u m = ∑ P i T i α i accum= \sum_{P_{i}}T_{i}\alpha _{i} accum=∑PiTiαi , c s k y \mathbf{c}_{\mathrm{sky}} csky is the rendered color from the Sky model, i n f inf inf is the upper bound distance, and s s k y s_{\mathrm{sky}} ssky is the one-hot semantic logits of the sky category.
As for the segment, we assign a one-hot vector to every object.
s k o b j − j [ l ] = { σ k o b j − j i f l = c a t e g o r y o f j ′ s i n s t a n c e 0 o t h e r w i s e , for l in category . \left.\mathbf{s}_{k}^{\mathrm{obj-j}}[l]=\left\{\begin{array}{cc}\sigma_{k}^{\mathrm{obj-j}}&\quad\mathrm{if} l=\mathrm{category~of~j's~instance} \\0&\quad\mathrm{otherwise}\end{array}\right.\right.,\text{for l in category}. skobj−j[l]={σkobj−j0ifl=category of j′s instanceotherwise,for l in category.
3.4 Towards Realistic Rendering
-
Sky Model
However, blending the sky color c s k y c_{sky} csky with background and foreground
rendering (Eq. 4) leads to potential inconsistency. Therefore, we introduce a BCE(Binary Cross Entropy) semantic regularization to alleviate this issue:L s k y = B C E ( 1 − a c c u m , S s k y ) . \mathcal{L}_{\mathrm{sky}}=\mathrm{BCE}(1-\mathrm{accum},\mathcal{S}_{\mathrm{sky}}). Lsky=BCE(1−accum,Ssky).
-
Resolving Conflict Samples:
Due to the fact that our background and foreground sampling are done independently, there is a chance that background samples fall within the foreground bounding box, causing incorrect classification of foreground samples as background samples.
The ambiguity is NOT observed in NSG [21] as NSG only samples a few points on the ray-plane intersections, and is unlikely to have much background truncated samples.
To address this issue, we devise a regularization term that minimizes the
density sum of background truncated samples to minimize their influence during the rendering process as:L a c c u m = ∑ P i ( t r ) σ i , \mathcal{L}_{\mathrm{accum}}=\sum_{P_{i}^{(\mathrm{tr})}}\sigma_{i}, Laccum=Pi(tr)∑σi,
where { P i ( tr ) } \{P_i^{(\text{tr})}\} {Pi(tr)} denotes background truncated samples.
3.5 Optimization
L = λ 1 L c o l o r + λ 2 L d e p t h + λ 3 L s e m + λ 4 L s k y + λ 5 L a c c u m , \mathcal{L}=\lambda_1\mathcal{L}_{\mathrm{color}}+\lambda_2\mathcal{L}_{\mathrm{depth}}+\lambda_3\mathcal{L}_{\mathrm{sem}}+\lambda_4\mathcal{L}_{\mathrm{sky}}+\lambda_5\mathcal{L}_{\mathrm{accum}}, L=λ1Lcolor+λ2Ldepth+λ3Lsem+λ4Lsky+λ5Laccum,
where L d e p t h \mathcal{L}_{\mathrm{depth}} Ldepth is from Depth-supervised NeRF and L s e m \mathcal{L}_{\mathrm{sem}} Lsem is from SemanticNeRF.
4. Experiment
-
Photorealistic Rendering
Dataset: KITTI+VKITTI
-
Instance-wise Editing
-
The blessing of modular design
-
Ablation Results
Unlike prior works [26SUDS, 15 Panoptic Neural Fields, 21NSG] that evaluate their method on a short sequence of 90 images, we use the full sequence from the dataset for all evaluations.