Supplemental Material - 补充材料

此博客为MonoGRNet的补充材料，提供了额外技术细节与分析实验。介绍了网络架构，选择VGG16作CNN骨干，阐述2D检测器、实例级深度估计等子网络设计；还进行结果可视化，将3D检测结果与其他方法对比，并展示实例级深度估计输出。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Supplemental Material - 补充材料 - MonoGRNet

Anonymous AAAI submission
Paper ID 3014

anonymous [ə'nɒnɪməs]：adj. 匿名的，无名的，无个性特征的
submission [səb'mɪʃ(ə)n]：n. 投降，提交，服从，意见，谦恭
geometric [,dʒɪə'metrɪk]：adj. 几何学的，几何学图形的
reason ['riːz(ə)n]：n. 理由，理性，动机 vi. 推论，劝说 vt. 说服，推论，辩论
monocular [mə'nɒkjʊlə]：adj. 单眼的，单眼用的
localization [,ləʊkəlaɪ'zeɪʃən]：n. 定位，局限，地方化
Tsinghua University：清华大学，清华
Microsoft Research Asia，MSRA：微软亚洲研究院
the Association for the Advancement of Artificial Intelligence，the American Association for Artificial Intelligence，AAAI：美国人工智能协会
Computer Science，CS：计算机科学
Computer Vision，CV：计算机视觉
peer [pɪə]：n. 贵族，同等的人，同龄人 vi. 凝视，盯着看，窥视 vt. 封为贵族，与...同等

arXiv (archive - the X represents the Greek letter chi [χ]) is a repository of electronic preprints approved for posting after moderation, but not full peer review.

In this supplemental material, we provide additional technical details, extra analysis experiments to the main paper.
在这个补充材料中，我们提供了额外的技术细节，主要论文的额外分析实验。

extra ['ekstrə]：adv. 特别地，非常，另外 n. 临时演员，号外，额外的事物，上等产品 adj. 额外的，另外收费的，特大的

Network Architecture

We choose VGG16 (Matthew and Rob 2014) as our CNN backbone. For the 2D detector, we mainly follow KittiBox (Teichmann et al. 2016). The 6 output channels consist of the objectiveness confidence (2 channels), the offsets of 2D bounding box center to the grid center (2 channels) and the size of the bounding box (2 channels). In the refinement stage, RoiAlign is applied to the region of interest in early feature maps to regress the delta values.
我们选择 VGG16 (Matthew and Rob 2014) 作为我们的 CNN 骨干网络。对于 2D 检测器，我们主要遵循 KittiBox (Teichmann et al. 2016)。6 个输出通道包括目标置信度 (2 个通道)，2D 边界框中心到网格中心的偏移 (2 个通道)和边界框的大小 (2 个通道)。在细化阶段，RoiAlign 应用于早期特征图中的感兴趣区域以回归 delta 值。

The sub-network for the instance-level depth estimation, 3D localization and corner offsets regression shares the same buffer zone following the backbone network. Notice that there are two buffer zones, with one extending from conv4_3 and another from pool5. Layers with 64 output channels are designed as bottlenecks to enforce the network to encode minimal sufficient information and prevent over-fitting, and also reduce the computational cost. Inverted residual connections (Sandler et al. 2018) are applied to neighboring bottlenecks to provide shortcuts for gradient propagation.
用于实例级深度估计，3D 定位和角偏移回归的子网络在骨干网络之后共享相同的缓冲区域。请注意，有两个缓冲区，一个从 conv4_3 延伸，另一个从 pool5 延伸。具有 64 个输出通道的层被设计为瓶颈，以强制网络编码最小的足够信息并防止过度拟合，并且还降低计算成本。反残差连接 (Sandler et al. 2018) 应用于相邻瓶颈，以提供梯度传播的快捷方式。

The branch (i.e., depth encoder) for instance-level depth estimation follows the architecture introduced in DORN (Fu et al. 2018) that stacks along the channel axis the outputs of a fully-connected global information encoder and 3 parallel dilated convolution layers (Yu and Koltun 2015) with different dilated rates, which for the early features are 6, 12, 24 and 2, 4, 8 for deep features. The branches for location estimation and corner regression both contain 4 convolution layers with 3 $\times$ 3 kernels and 1 $\times$ 1 stride. While there are 96 weighted layers in total, the deepest path i.e., from the input to IDE output, only contains 26 weighted layers, since the 3D reasoning branches are parallel. Detailed network configuration is shown in Table A.
用于实例级深度估计的分支 (即，深度编码器) 遵循在 DORN (Fu et al. 2018) 中引入的架构，该架构沿着通道轴堆叠完全连接的全局信息编码器和 3 个并行的扩张卷积层的输出 (Yu and Koltun 2015) 具有不同的扩张率，其中早期特征为 6, 12, 24 和 2, 4, 8 为深度特征。位置估计和角回归的分支都包含 4 个卷积层，3 $\times$ 3 的核和 1 $\times$ 1 步长。虽然总共有 96 个加权层，但最深的路径，即从输入到 IDE 输出，仅包含 26 个加权层，因为 3D 推理分支是平行的。详细的网络配置如表A所示。

在这里插入图片描述
Table A: Network configuration.

Results Visualization

In Fig. A, we compare our 3D detection results with 3DOP (Chen et al. 2015) and Mono3D (Chen et al. 2016) by visualizing in the 3D space and on the image. We also present the instance-level depth estimation outputs in Fig. B.
在图 A 中，我们通过在 3D 空间中和在图像上进行可视化，将我们的 3D 检测结果与 3DOP (Chen et al. 2015) and Mono3D (Chen et al. 2016) 进行比较。我们还给出了图 B 中的实例级深度估计输出。

在这里插入图片描述
Figure A: Qualitative comparison. Blue boxes indicate ground truths and orange ones are predictions. It can be seen from (a) that our method is the most stable when dealing with far objects. In corner-cases when the object is truncated by the image boundaries, i.e., in (d), our method can still localize the whole ABBox-3D.
Figure A: 定性比较。 蓝框表示 ground truth，橙色表示预测值。从 (a) 可以看出，我们的方法在处理远处物体时最稳定。在物体被图像边界截断的角落情况下，即在 (d) 中，我们的方法仍然可以定位整个 ABBox-3D。

在这里插入图片描述
Figure B: Instance-level depth. Each grid cell predicts the 3D centric depth of its nearest instance. Cells with objectiveness confidence (provided by the 2D detector) no less than 0.1 are kept for visualization
Figure B: Instance-level depth. 每个网格单元预测其最近实例的 3D 中心深度。具有目标置信度 (由 2D 检测器提供) 不小于 0. 1 的 cells 用于可视化

References

Visualizing and Understanding Convolutional Networks
MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving
MobileNetV2: Inverted Residuals and Linear Bottlenecks