NeRF：视角合成下场景隐式表达为神经辐射场

原创已于 2022-12-07 16:33:25 修改 · 1.1k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #神经网络 #计算机视觉

于 2022-11-09 16:50:01 首次发布

本文介绍了NeRF如何通过深度学习的MLP模型隐式表示场景，包括其使用的位置编码、体积渲染方法，以及相机参数的处理。通过层次采样提高效率并减少冗余计算，关键概念包括观察位置、方向编码、密度预测和颜色生成。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Pipeline

在这里插入图片描述

NeRF 使用一个 MLP 来感知隐式表示（Latent representation），它的假设函数（Hypothesis funtion）定义为 $H:(x,d)→(c,σ)H:(\mathbf{x},\mathbf{d})\to(\mathbf{c},\sigma)$ 。其中 $x=(x,y,z)\mathbf{x}=(x,y,z)$ 表示观察位置（归一化）， $d=(θ,ϕ)\mathbf{d}=(\theta,\phi)$ 表示观察方向， $c=(r,g,b)\mathbf{c}=(r,g,b)$ 表示发光颜色， $σ\sigma$ 表示体素密度。

这里只需要两个角度 $θ\theta$ 和 $ϕ\phi$ 就能表示方向，因为球坐标下：
$\begin{cases} x=r\sin\theta\cos\phi\\ y=r\sin\theta\sin\phi\\ z=r\cos\theta \end{cases}$
在表示方向时令 $r≡1r\equiv1$ ，此时 $d\mathbf{d}$ 只需要 $2$ 个自由度即可表示任意方向。

MLP

作者采信了 Rahaman 等人的结论，即神经网络倾向于学习低频信号。为了提高神经网络学习高频部分的能力，可以利用高频的映射函数将输入映射到高维空间。NeRF 延续了这种思想，作者设计了作用于 MLP 输入的位置编码（Positional encoding）函数 $γ:R→R2L\gamma:\mathbb{R}\to\mathbb{R}^{2L}$ ，其中 $L$ 控制了位置编码的频率：
$\gamma(p)= \begin{pmatrix} \sin{2^0\pi p}\\ \cos{2^0\pi p}\\ \sin{2^1\pi p}\\ \cos{2^1\pi p}\\ \cdots\\ \sin{2^{L-1}\pi p}\\ \cos{2^{L-1}\pi p}\\ \end{pmatrix}$
实验显示，对 $x\mathbf{x}$ 取 $L = 10$ 、 $d\mathbf{d}$ 取 $L = 4$ 能够得到相对优秀的结果。

在这里插入图片描述

上图中，位置编码维度 $dim⁡γ(x)=3×2×10=60\dim\gamma(\mathbf{x})=3\times2\times10=60$ ， $dim⁡γ(d)=3×2×4=24\dim\gamma(\mathbf{d})=3\times2\times4=24$ 。

$γ(x)\gamma(\mathbf{x})$ 经过 $8$ 层 $Pos_Enci\mathrm{Pos\_Enc}_i$ 后产生 $256$ 维的特征向量。特别地，NeRF 引入了 DeepSDF 架构中的 Skip connection， $Pos_Enc4\mathrm{Pos\_Enc}_4$ 的输出会先与 $γ(x)\gamma(\mathbf{x})$ 进行拼接，再进入 $Pos_Enc5\mathrm{Pos\_Enc}_5$ 。

$256$ 维的特征向量经过 $1$ 层 $Sigma_Out\mathrm{Sigma\_Out}$ 产生体素密度 $σ\sigma$ ；此外，该特征向量送入 $1$ 层 $Pos_Enc_Fin\mathrm{Pos\_Enc\_Fin}$ 后与 $γ(d)\gamma(\mathbf{d})$ 拼接，依次送入 $1$ 层 $Dir_Enc\mathrm{Dir\_Enc}$ 和 $1$ 层 $RGB_Out\mathrm{RGB\_Out}$ ，产生发光颜色 $c\mathbf{c}$ 。
$Pos_Enci(x)=ReLU(WTx),W∈R⋅×256 \mathrm{Pos\_Enc}_i(x)=\mathrm{ReLU}(W^Tx),W\in\mathbb{R}^{\cdot\times256}$ $Pos_Enc_Fin(x)=ReLU(WTx),W∈R256×256 \mathrm{Pos\_Enc\_Fin}(x)=\mathrm{ReLU}(W^Tx),W\in\mathbb{R}^{256\times256}$ $Dir_Enc(x)=ReLU(WTx),W∈R280×128 \mathrm{Dir\_Enc}(x)=\mathrm{ReLU}(W^Tx),W\in\mathbb{R}^{280\times128}$ $Sigma_Out(x)=ReLU(WTx),W∈R256×1 \mathrm{Sigma\_Out}(x)=\mathrm{ReLU}(W^Tx),W\in\mathbb{R}^{256\times1}$ $RGB_Out(x)=sigmoid(WTx),W∈R128×3 \mathrm{RGB\_Out}(x)=\mathrm{sigmoid}(W^Tx),W\in\mathbb{R}^{128\times3}$

Volume Rendering

观察射线可以参数化为 $t$ 的函数（从 $origin\mathbf{o}\mathrm{rigin}$ 沿 $direction\mathbf{d}\mathrm{irection}$ 前进） $r(t)=o+td\mathbf{r}(t)=\mathbf{o}+t\mathbf{d}$ 。渲染时，通过在观察射线上不断对 MLP 进行查询 $H(r,d)=(c,σ)H(\mathbf{r},\mathbf{d})=(\mathbf{c},\sigma)$ 得到体素颜色和密度。给定视锥近点 $t_n$ 和远点 $t_f$ ，理想的渲染颜色为：
$C(\mathbf{r})=\int_{t_n}^{t_f}{T(t)\sigma(\mathbf{r})\mathbf{c}(\mathbf{r},\mathbf{d})\mathrm{d}t}$
其中 $T (t)$ 为累积透射比系数（体素从无穷远到视锥近点的积分贡献）：
$T(s)=\exp\left(-\int_{t_n}^t{\sigma(\mathbf{r})\mathrm{d}s}\right)$
使用黎曼和进行离散化，将 $t_n,t_f]$ 划分为 $N$ 个均匀子区间，在这些子区间内各自均匀采样 $t_i$ （其中 $1≤i≤N1\le i\le N$ ）：
$t_i\sim\mathcal{U}\left[t_n+(i-1)\cdot\frac{t_f-t_n}{N},t_n+i\cdot\frac{t_f-t_n}{N}\right]$
当 $N$ 足够大时，不妨认为每段区间内各数值函数为定值，可以使用矩形面积计算积分。记区间长度 $δi=ti+1−ti\delta_i=t_{i+1}-t_i$ ，此时累积透射比系数为：
$T_i=\exp\left(-\int_{t_1}^{t_i}{\sigma(\mathbf{r})\mathrm{d}s}\right)=\exp\left(-\sum_{j=1}^{i-1}{\sigma_j\delta_j}\right)$
类似地，渲染颜色的离散化过程：
$C^(r)=∑i=1N∫titi+1T(t)σ(r)c(r,d)dt=∑i=1Nσici∫titi+1exp⁡(−∫t1tiσ(r)ds−∫titσ(r)ds)dt=∑i=1NTiσici∫titi+1exp⁡(−σi∫titds)dt=∑i=1NTiσici∫titi+1e−σi(t−ti)dt=∑i=1NTiσici⋅e−σi(t−ti)−σi∣titi+1=−∑i=1NTi(e−σi(ti+1−ti)−1)ci=∑i=1NTi(1−e−σiδi)ci \begin{aligned} \hat{C}(\mathbf{r}) &=\sum_{i=1}^{N}{\int_{t_i}^{t_{i+1}}{T(t)\sigma(\mathbf{r})\mathbf{c}(\mathbf{r},\mathbf{d})\mathrm{d}t}}\\ &=\sum_{i=1}^{N}{\sigma_i\mathbf{c}_i\int_{t_i}^{t_{i+1}}{\exp\left(-\int_{t_1}^{t_i}{\sigma(\mathbf{r})\mathrm{d}s}-\int_{t_i}^t{\sigma(\mathbf{r})\mathrm{d}s}\right)\mathrm{d}t}}\\ &=\sum_{i=1}^{N}{T_i\sigma_i\mathbf{c}_i\int_{t_i}^{t_{i+1}}{\exp\left(-\sigma_i\int_{t_i}^t{\mathrm{d}s}\right)\mathrm{d}t}}\\ &=\sum_{i=1}^{N}{T_i\sigma_i\mathbf{c}_i\int_{t_i}^{t_{i+1}}{e^{\displaystyle{-\sigma_i(t-t_i)}}\mathrm{d}t}}\\ &=\sum_{i=1}^{N}{T_i\sigma_i\mathbf{c}_i\cdot\left.\frac{e^{\displaystyle{-\sigma_i(t-t_i)}}}{-\sigma_i}\right|_{t_i}^{t_{i+1}}}\\ &=-\sum_{i=1}^{N}{T_i\left(e^{\displaystyle{-\sigma_i(t_{i+1}-t_i)}}-1\right)\mathbf{c}_i}\\ &=\sum_{i=1}^{N}{T_i\left(1-e^{\displaystyle{-\sigma_i\delta_i}}\right)\mathbf{c}_i}\\ \end{aligned}$

Hierarchical volume sampling

为了改变 NeRF 的低效性，排除无关空间和遮挡区域这些对渲染没有贡献的部分，采用两级粒度的分层采样（即采用粗粒度和细粒度两个 MLP）。

粗粒度渲染中，采样 $N_c$ 个点，记 $ci\mathbf{c}_i$ 的加权系数 $wi=Ti(1−e−σiδi)w_i=T_i\left(1-e^{\displaystyle{-\sigma_i\delta_i}}\right)$ ：
$C^c(r)=∑i=1Ncwici \hat{C}_c(\mathbf{r})=\sum_{i=1}^{N_c}{w_i\mathbf{c}_i}$
归一化的系数 $w^i=wi∑j=1Ncwj∈[0,1]\hat{w}_i=\cfrac{w_i}{\displaystyle\sum_{j=1}^{N_c}{w_j}}\in[0,1]$ 可以理解为概率密度函数（PDF）。通过逆变换采样（Inverse transform sampling）采样 $N_f$ 个点，总共 $N_c+N_f$ 个点用于细粒度渲染：
$C^f(r)=∑i=1Nc+Nfwici \hat{C}_f(\mathbf{r})=\sum_{i=1}^{N_c+N_f}{w_i\mathbf{c}_i}$
不难发现，这 $N_f$ 个点会更多落在 $w^i\hat{w}_i$ 较大的区间，也就是对颜色计算贡献更大的体素，因此能够提升 NeRF 对细节的表现。

Loss function

采用两级粒度 MLP 的二范数损失的平方和：
$L=∑r∈R[∥C^c(r)−C(r)∥22+∥C^f(r)−C(r)∥22] \mathcal{L}=\sum_{\mathbf{r}\in\mathcal{R}}\left[\left\|\hat{C}_c(\mathbf{r})-C(\mathbf{r})\right\|^2_2+\left\|\hat{C}_f(\mathbf{r})-C(\mathbf{r})\right\|^2_2\right]$
其中 $R\mathcal{R}$ 为一个 batch 中采样光线的集合。

Camera intrinsics & extrinsics

相机的外参矩阵用于将世界坐标系变换到相机坐标系；而内参矩阵用于将相机坐标系变换到像素坐标系。

给定世界坐标系 $Pw=(Xw,Yw,Zw)\mathbf{P}_w=(X_w,Y_w,Z_w)$ ，相机坐标系 $Pc=(Xc,Yc,Zc)\mathbf{P}_c=(X_c,Y_c,Z_c)$ 一般取世界坐标系下相机的位置作为原点、相机的朝向作为 $Z$ 轴的方向。 $Pc\mathbf{P}_c$ 可以由 $Pw\mathbf{P}_w$ 经过旋转和平移变换得到，为了方便引入平移变换，后面的变换矩阵使用四元数表示：
$(x,y,z,w)\equiv\left(\frac{x}{w},\frac{y}{w},\frac{z}{w}\right)$
旋转矩阵（绕 $x$ 、 $y$ 、 $z$ 三轴）：
$\mathbf{R}_x= \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & \cos\theta & -\sin\theta & 0\\ 0 & \sin\theta & \cos\theta & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}$ $\mathbf{R}_y= \begin{pmatrix} \cos\theta & 0 & \sin\theta & 0\\ 0 & 1 & 0 & 0\\ -\sin\theta & 0 & \cos\theta & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}$ $\mathbf{R}_z= \begin{pmatrix} \cos\theta & -\sin\theta & 0 & 0\\ \sin\theta & \cos\theta & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}$

平移矩阵：
$\mathbf{T}= \begin{pmatrix} 0 & 0 & 0 & t_x\\ 0 & 0 & 0 & t_y\\ 0 & 0 & 0 & t_z\\ 0 & 0 & 0 & 1 \end{pmatrix}$
世界坐标到相机坐标的变换：
$\mathbf{P}_c=\begin{pmatrix}X_c\\Y_c\\Z_c\\1\end{pmatrix}= \mathbf{R}_x\mathbf{R}_y\mathbf{R}_z\mathbf{T}\begin{pmatrix}X_w\\Y_w\\Z_w\\1\end{pmatrix}= \begin{pmatrix}\mathbf{R}_{3\times3} & \mathbf{t}_{3\times1}\\\mathbf{0}_{1\times3} & \mathbf{1}_{1\times1}\end{pmatrix}_{4\times4}\mathbf{P}_w$
其中， $R\mathbf{R}$ 和 $t\mathbf{t}$ 就是相机的外参：
$\mathbf{R}_{3\times3}= \begin{pmatrix} 1 & 0 & 0\\ 0 & \cos\theta_x & -\sin\theta_x\\ 0 & \sin\theta_x & \cos\theta_x \end{pmatrix} \begin{pmatrix} \cos\theta_y & 0 & \sin\theta_y\\ 0 & 1 & 0\\ -\sin\theta_y & 0 & \cos\theta_y \end{pmatrix} \begin{pmatrix} \cos\theta_z & -\sin\theta_z & 0\\ \sin\theta_z & \cos\theta_z & 0\\ 0 & 0 & 1 \end{pmatrix}$ $\mathbf{t}_{3\times1}=\begin{pmatrix}t_x\\t_y\\t_z\end{pmatrix}$

对于理想光学系统，平行于光轴的入射光线过像方焦点，过主点的入射光线不改变方向，根据这两条光线的交点可以找到成像位置。当物距足够大，也不考虑景深时，认为始终清晰、完整成像，物距可以近似为焦距，这就是针孔假设（Pinhole approximation）。

在计算机视觉中，相机一般采用上述假设下的针孔模型（Pinhole model）。为了保持符号一致性（针孔模型成倒像），选取像平面的共轭。像面坐标系 $Pi=(Xi,Yi)\mathbf{P}_i=(X_i,Y_i)$ 可以由相机坐标系 $Pc\mathbf{P}_c$ 得到：
$\frac{X_c}{X_i}=\frac{Y_c}{Y_i}=\frac{Z_c}{f} \Longrightarrow \begin{cases} X_i=\dfrac{X_c}{Z_c}f\\ Y_i=\dfrac{Y_c}{Z_c}f \end{cases}$
写成矩阵形式：
$\mathbf{P}_i=\begin{pmatrix}X_i\\Y_i\\Z_c\end{pmatrix}= \begin{pmatrix} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix}X_c\\Y_c\\Z_c\\1\end{pmatrix}= \begin{pmatrix} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0 \end{pmatrix} \mathbf{P}_c$
像素坐标系 $Pp=(u,v)\mathbf{P}_p=(u,v)$ 与像面坐标系共享同一个平面，区别在于一个连续、一个离散。此外，像素坐标系具有有限坐标，并且它的原点在画面的左上方。为了从像面坐标系（米）变换到像素坐标系（像素），假设缩放系数为 $ρu×ρv\rho_u\times\rho_v$ 、平移量为 $c_x,c_y)$ ，则像素坐标：
$\begin{aligned} \mathbf{P}_p=\begin{pmatrix}u\\v\\w\end{pmatrix} &=\begin{pmatrix} \rho_u & 0 & c_x\\ 0 & \rho_v & c_y\\ 0 & 0 & 1 \end{pmatrix} \mathbf{P}_i\\ &=\begin{pmatrix} \rho_u & 0 & c_x\\ 0 & \rho_v & c_y\\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0 \end{pmatrix} \mathbf{P}_c\\ &=\begin{pmatrix} f\rho_u & 0 & c_x & 0\\ 0 & f\rho_v & c_y & 0\\ 0 & 0 & 1 & 0 \end{pmatrix} \mathbf{P}_c\\ &=\begin{pmatrix}\mathbf{K}_{3\times3} & \mathbf{0}_{3\times1}\end{pmatrix}_{3\times4} \begin{pmatrix} \mathbf{R}_{3\times3} & \mathbf{t}_{3\times1}\\ \mathbf{0}_{1\times3} & \mathbf{1}_{1\times1} \end{pmatrix}_{4\times4}\mathbf{P}_w \end{aligned}$
其中， $K\mathbf{K}$ 就是相机的内参：
$\mathbf{K}_{3\times3}= \begin{pmatrix} f\rho_u & 0 & c_x\\ 0 & f\rho_v & c_y\\ 0 & 0 & 1 \end{pmatrix}$

References

B. Mildenhall, P. P. Srinivasan, M. Tancik, et al. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV’20. https://arxiv.org/abs/2003.08934
https://towardsdatascience.com/what-are-intrinsic-and-extrinsic-camera-parameters-in-computer-vision-7071b72fb8ec