GIRAFFE论文阅读 2021CVPR

南陵花神

已于 2022-06-24 23:21:17 修改

阅读量3.5k

点赞数 3

文章标签：计算机视觉深度学习 opencv

于 2022-01-18 14:41:21 首次发布

本文链接：https://blog.csdn.net/cpu077/article/details/122552733

版权

2021 CVPR最佳论文GIRAFFE论文阅读

论文链接
link
提取码：fine

文章结尾附带图中Generator (step1~step8) 部分代码注释

论文问题的提出

1、论文背景：NeRF、GRAF、GAN

2、存在问题：GRAF 在 NeRF 的基础上减少了相机的 pose 这一必要参数，并
且可以简单的修改 object 的 shape/texture/appearance，但是并不能完全解耦。

3、方法目的：为了适应更复杂的场景，同时为了能自由的增加物体，将生成的
场景从 single-object 扩展到 multi-object。

论文主要思路流程解析

Step 1
step one

camera.py中包括像素矩阵函数，图像矩阵函数、相机矩阵函数，世界矩阵函数等，在common.py函数中实现由像素到世界坐标的转换，得到每个像素点对应的ray。数学公式如下，知Zc反求即可。
像素坐标系到世界坐标系的转换
step 2
step two

在得到ray之后，分为两个部分输入到Neural Feature Fields,其中dj是指第j条ray的view direction(视角)，由image的像素决定j，如果4 * 4*则j=16，文中是H_v W_v。X_ij指第j条ray上的第i个点。最后分别对dj和Xij进行特征提升(升维)。

step 3
step three

为了克服NeRF和GRAF的局限，Z_s^N、Z_a^N中s代表shape，a代表appearance，GIRAFFE从高斯变换中采样N个向量，其中N表示场景中的目标的个数(两个车则N=3)，TN表示对目标进行仿射变换，从而达到对目标的平移或旋转。函数如下

// 
def get transformations(self,val s=[[0.5,0.5,0.5]],val t=[[0.5,0.5,0.5]],val r=[0.5],batch_size=32,to_device=True):

step 4
step four
此部分code在generator.py以及其中所调用的函数中，h是一个MLP网络。不同于NeRF和GRAF生成体密度和颜色，GIRAFFE生成的是体密度和features(特征)，输出的是每个目标的第j条ray上第==i个点的体密度和特征，==是一个很大的tensor。

sigma=F.relu(torch.stack(sigma dim=0))
feat=torch.stack(feat,dim=0)

step 5
step five
将得到的tensor丢到组合函数中

sigma_sum,feat_weighted=self.composite_function(sigma,feat)

通过sigma max或平均值合成这些map。

step 6
step six
体渲染，将每条ray上的i个点进行合并，也就是对i进行消除，得到一个H_v*W_v的feature map。

return feat_map

step 7
step seven
神经渲染，是一个2D的CNN网络，主要目的可以理解为将特征向量的tensor转成为RGB，类似于解码器。有关于此CNN的model论文中有详细的介绍。对应函数在neural_renderer.py中

class NeuralRenderer(nn.Module):
...
return rgb

step 8
step eight

判别器，很简单，就是让Generator尽量骗过Discriminator。值得注意的是本文使用FID对GAN生成
的图像进行eval。FID是数据集的img和生成器的img经过inceptionv3后的特征之间的距离。

FID的缺点即是inception是在Image Net上训练出来的，二FID恰巧需要很多的特征，Image Net可能无法满足，并且运算慢，只有均值和方差两个指标。

def calculate_frechet_distance(mu1,sigma1,mu2,sigma2,eps=1e-6):
...
return (diff.dot(diff)+np.trace(sigma)

代码复现

1、使用文中的预训练模型进行渲染得到效果如下：(仅展示对形状进行编码的效果)
link
2、使用自己的数据集进行模型的训练，我们可以得到如下结果：(仅展示对旋转效果进行编码)
link

结果缺陷

1、在使用类人脸数据集时，会出现存在“Dataset Bias”的现象。如下图，由于数据集中的人脸视线都是朝着相机，也因此生成图像的人脸视线也都是朝着相机，这与要求的视角一致性所矛盾。

2、在对多目标解耦中效果表现不完美，模型的解耦能力不够强，会出现景前物体被“附着”到背景上等问题。

代码注释

# 魔法体渲染
    def volume_render_image(self, latent_codes, camera_matrices,
                            transformations, bg_rotation, mode='training',
                            it=0, return_alpha_map=False,
                            not_render_background=False,
                            only_render_background=False):
        res = self.resolution_vol
        device = self.device
        n_steps = self.n_ray_samples
        n_points = res * res
        depth_range = self.depth_range
        batch_size = latent_codes[0].shape[0]
        z_shape_obj, z_app_obj, z_shape_bg, z_app_bg = latent_codes
        assert(not (not_render_background and only_render_background))

        # Arange Pixels
        pixels = arange_pixels((res, res), batch_size,
                               invert_y_axis=False)[1].to(device)
        pixels[..., -1] *= -1.
        # Project to 3D world
        pixels_world = image_points_to_world(
            pixels, camera_mat=camera_matrices[0],
            world_mat=camera_matrices[1])
        camera_world = origin_to_world(
            n_points, camera_mat=camera_matrices[0],
            world_mat=camera_matrices[1])
        ray_vector = pixels_world - camera_world
        # batch_size x n_points x n_steps
        di = depth_range[0] + \
            torch.linspace(0., 1., steps=n_steps).reshape(1, 1, -1) * (
                depth_range[1] - depth_range[0])
        di = di.repeat(batch_size, n_points, 1).to(device)
        if mode == 'training':
            di = self.add_noise_to_interval(di)

        n_boxes = latent_codes[0].shape[1]
        feat, sigma = [], []
# 这是解码器前向传递将3D点和相机观察方向映射到每个sigma和RGB(特征)值的地方
# 不同的生成器应用于背景
        n_iter = n_boxes if not_render_background else n_boxes + 1
        if only_render_background:
            n_iter = 1
            n_boxes = 0
        for i in range(n_iter):
            if i < n_boxes:  # Object
                p_i, r_i = self.get_evaluation_points(
                    pixels_world, camera_world, di, transformations, i)
                z_shape_i, z_app_i = z_shape_obj[:, i], z_app_obj[:, i]

                feat_i, sigma_i = self.decoder(p_i, r_i, z_shape_i, z_app_i)

                if mode == 'training':
                    # As done in NeRF, add noise during training
                    sigma_i += torch.randn_like(sigma_i)

                # Mask out values outside
                padd = 0.1
                mask_box = torch.all(
                    p_i <= 1. + padd, dim=-1) & torch.all(
                        p_i >= -1. - padd, dim=-1)
                sigma_i[mask_box == 0] = 0.

                # Reshape
                sigma_i = sigma_i.reshape(batch_size, n_points, n_steps)
                feat_i = feat_i.reshape(batch_size, n_points, n_steps, -1)
            else:  # Background
                p_bg, r_bg = self.get_evaluation_points_bg(
                    pixels_world, camera_world, di, bg_rotation)

                feat_i, sigma_i = self.background_generator(
                    p_bg, r_bg, z_shape_bg, z_app_bg)
                sigma_i = sigma_i.reshape(batch_size, n_points, n_steps)
                feat_i = feat_i.reshape(batch_size, n_points, n_steps, -1)

                if mode == 'training':
                    # As done in NeRF, add noise during training
                    sigma_i += torch.randn_like(sigma_i)

            feat.append(feat_i)
            sigma.append(sigma_i)
        sigma = F.relu(torch.stack(sigma, dim=0))
        feat = torch.stack(feat, dim=0)

        if self.sample_object_existance:
            object_existance = self.get_object_existance(n_boxes, batch_size)
            # add ones for bg
            object_existance = np.concatenate(
                [object_existance, np.ones_like(
                    object_existance[..., :1])], axis=-1)
            object_existance = object_existance.transpose(1, 0)
            sigma_shape = sigma.shape
            sigma = sigma.reshape(sigma_shape[0] * sigma_shape[1], -1)
            object_existance = torch.from_numpy(object_existance).reshape(-1)
            # set alpha to 0 for respective objects
            sigma[object_existance == 0] = 0.
            sigma = sigma.reshape(*sigma_shape)

        # Composite
# 使用复合函数，通过sigma max或平均值合成这些map
        sigma_sum, feat_weighted = self.composite_function(sigma, feat)

        # Get Volume Weights
# 最后 通过沿射线向量用过sigma体积对特征图进行加权来创建最终图像
# 最终结果是看到的动画之一的单个窗口的单个帧(有关如何构造di和ray_vector的详细信息)
# 请参阅generator.py
        weights = self.calc_volume_weights(di, ray_vector, sigma_sum)
        feat_map = torch.sum(weights.unsqueeze(-1) * feat_weighted, dim=-2)

        # Reformat output
        # 重新格式化输出
        feat_map = feat_map.permute(0, 2, 1).reshape(
            batch_size, -1, res, res)  # B x feat x h x w
        feat_map = feat_map.permute(0, 1, 3, 2)  # new to flip x/y
        if return_alpha_map:
            n_maps = sigma.shape[0]
            acc_maps = []
            for i in range(n_maps - 1):
                sigma_obj_sum = torch.sum(sigma[i:i+1], dim=0)
                weights_obj = self.calc_volume_weights(
                    di, ray_vector, sigma_obj_sum, last_dist=0.)
                acc_map = torch.sum(weights_obj, dim=-1, keepdim=True)
                acc_map = acc_map.permute(0, 2, 1).reshape(
                    batch_size, -1, res, res)
                acc_map = acc_map.permute(0, 1, 3, 2)
                acc_maps.append(acc_map)
            acc_map = torch.cat(acc_maps, dim=1)
            return feat_map, acc_map
        else:
            return feat_map