2D Gaussian Splatting 文章+代码串读（无敌详细的疯狂解读）

最新推荐文章于 2025-05-20 08:09:36 发布

自动驾驶之心

最新推荐文章于 2025-05-20 08:09:36 发布

阅读量5.6k

点赞数 19

原文链接：https://mp.weixin.qq.com/s?__biz=Mzg2NzUxNTU1OA==&mid=2247619157&idx=3&sn=391b693d1d8aaa2173dbb894e736a302&chksm=cf8d56afd756151226ba027781e96037f35599c76e06757a9fbc06fdb4dfd146afcc157d9640&scene=126&sessionid=0

版权

作者 | Yuhan Chen 编辑 | 自动驾驶之心

原文链接：https://zhuanlan.zhihu.com/p/708372232

点击下方卡片，关注“自动驾驶之心”公众号

戳我-> 领取自动驾驶近15个方向学习路线

>>点击进入→自动驾驶之心『三维重建』技术交流群

本文只做学术分享，如有侵权，联系删文

开门见山

首先开门见山，为什么我非常关注2DGS而不是3DGS。3D Gaussian Splatting作为去年年末至今最火爆的方向之一，诞生了非常多好的作品。但是，在我尝试了非常多的代码后。我发现，最好的效果，或者换句话说，转Mesh之后，最好的效果，2DGS几乎是完胜的。

首先我们来看一下具体的一个展示效果：

3DGS 使用元象UE5插件自带的3DGS构建工具

2DGS训练

2DGS的效果非常好，并且转mesh的质量也非常高。因此我自认为这是一个当前对于自动驾驶中，街景重建的一个很好的解决方案。因此，我花了大概3天的时间，梳理了一下整体的代码。和大家一起，串读一下整篇文章。

2DGS的文章链接：
arxiv.org/pdf/2403.17888

2DGS的代码链接：
github.com/hbb1/2d-gaussian-splatting

INTRODUCTION

首先，引言中主要讲的内容，是新颖视图合成的工作，以及致敬3DGS的贡献。同时暗喻自己的观点，就是表面元素基元是复杂几何的有效表达（但碍于时代无法落地）2DGS呢主要就是结合了3DGS和表面元素基元的优势（将高斯椭球定义为椭圆盘）：1. 使用2D椭球盘用来进行表面建模（Gaussian model部分）；2. 使用Ray-Split和体积积分来实现透视正确的Splatting （Cuda部分）；3. 引入了两种新的损失改进表面重建。

2DGS和3DGS的价值评估不同；3DGS利用交叉平面，而2DGS利用多视图一致性

RELATED WORK

接下来就是相关工作。主要讲俩事儿，第一个，致敬新颖视图合成，致敬3D Gaussian Splatting及其后续开发的工作。第二个，回顾可微点渲染的进展。那么对于可微点渲染，这里推荐两篇可以看看：

Perspective Accurate Splatting
graphics.stanford.edu/~mapauly/Pdfs/perspective_accurate_splatting.pdf

Differentiable Surface Splatting for Point-based Geometry Processing
arxiv.org/abs/1906.04173

3D Gaussian Splatting

回顾了3DGS的基础知识。这部分我比较推荐看一下这两篇文章：

holk：一文带你入门 3D Gaussian Splatting

八氨合氯化钙：3D Gaussian Splatting中的数学推导

这俩肯定比2DGS那一小段讲得清楚哈。由于3DGS的原文相对还是对初学者不太友好。我建议可以粗略看看上面两篇文章即可。在阐述完3DGS的基本原理后，2DGS其实还是针对表面重建的问题，对3DGS进行了一手拷打：1. 体积辐射率与表面的薄特性相冲突；2. 3DGS没有对法线进行建模；3. 3DGS的光栅化过程缺乏多视图一致性，导致了不同相机的2D相交平面不同，因此产生了我们开头提到的噪声（模糊杂块）

2D Gaussian Splatting

（Part 1）

首先，我们来看一下2DGS的整体模型。2DGS整体采用“平坦2D椭圆盘”来表示稀疏点云，这种表面基元将密度分布到平面圆盘中。将法线定义为密度变化最陡的方向（云里雾里的。。），为了与薄表面对齐。

2DGS的示意图

2DGS的表达其实很简单哈，如上面的图所示，一个中心点固定中心点，然后使用了两个切向量和来确定平面。然后呢，还定义了一个=x是这个平面的法向量。同时呢，对于这个2DGS呢，肯定还有大小哈，来控制它的致密度。所以就设置了两个缩放因子（椭圆嘛，是不是这样理解）。用了两个缩放因子和来表示。

细讲1

那么，如何实现3DGS椭球的缩放和旋转呢？

缩放：缩放就非常简单了哈，只需要直接调整对角元素就行。越大的方差值表明该轴的分布更广，即这个面更长。

协方差矩阵与旋转矩阵：参考内容：

翻译：协方差矩阵的几何解释
njuferret.github.io/2019/07/28/2019-07-28_geometric-interpretation-covariance-matrix/

（Part 2）

OK，我们继续回归到2DGS的本身。在我们知道了2DGS的表示方式之后，我们需要构建2DGS的表达。首先，构建旋转矩阵R。同样的，它是一个3X3的矩阵。我们使用两个同一平面向量以及法向量构成新的旋转矩阵：R=(,,),同时，为了方面计算，将缩放因子，也搞成一个缩放矩阵S=(,,0)，将最后一项置0。

对于上述空间中的旋转矩阵以及其计算，我推荐以下参考：

https://blog.csdn.net/KYJL888/article/details/91432684
blog.csdn.net/KYJL888/article/details/91432684

三维空间刚体运动2：旋转向量与罗德里格斯公式（最详细推导）_空间刚体位移计算公式-CSDN博客
blog.csdn.net/shao918516/article/details/105109278

OK，因此呢，我们便可以求出2DGS局部的切平面，在世界空间中的定义，并将其参数化：

2DGS原文定义

其中，H是一个4X4的矩阵。表示2DGS的齐次变换（也就是2DGS的几何形状）（这里使用的是uv坐标系而不是xyz坐标系，更多是方便大家理解。对于局部平面的一个思想），对于整个2DGS局部切平面中的一点(u,v)，我们此时理解为高斯函数的均值为0，方差为1，就可以通过下面这个公式（标准的高斯）去计算此时u的函数值：

二维高斯

然后呢，我们就可以利用矩阵H讲局部变换点u变换到全局坐标系中了。

作者还讲了下对于上面提到的学习参数有哪些呢？对,中心点可学习（初始化是使用的Colmap提供的点），(,)和(,)都是可学习的，同时呢，还有高斯圆盘的透明度以及外观c。

细讲2

这里我们需要讲一下3DGS的情况。方便大家理解。首先我们需要了解下球谐函数。球谐函数可以类比泰勒展开或傅里叶级数，都是不同价的基函数线性组合而来。只不过傅里叶变换，是三角基；小波变换是小波基；OK，球谐函数是球基（球函数）。球谐函数，我推荐下面这个博客，讲得还不错：

浦夜：球谐函数介绍（Spherical Harmonics）

那么它本质上和傅里叶函数一样，我们认为阶数越多，拟合效果就越好。（当然也可能过拟合）

球谐函数有啥性质呢，为啥我们想用球谐函数作为图像渲染的手段呢？球谐函数主要有俩性质：1.正交性：基函数之间是线性独立的；2.旋转不变性：改变光照后，可以推导新的光照结果。

我们认为，空间中的每一个单一点带一组基函数，就可以描述空间中的光照（应用中一般只用3-4阶）。具体的球谐光照，大家可以参考以下文章，也是3DGS论文中所引用的文章：

Monica的小甜甜：【论文复现】Spherical Harmonic Lighting:The Gritty Details

Monica的小甜甜：【论文复现】An Efficient Representation for Irradiance Environment Maps

（Part 3）

接下来呢，就是非常给劲儿的抛射环节。这个部分与3DGS也有非常大的区别。具体体现，可以查看Cuda的Forward.h部分。我们还是先讲paper。首先作者blablabla说了一堆，大概意思是说，它使用了齐次坐标的计算公式，将传统的2D基元投影问题，搞成了2D-2D的映射方式。

公式

转uv坐标系

定义

定义式

（Part 4）

接下来，作者开始对2DGS的分布映射进行了一些优化。因为2DGS的分布，是一个圆盘，当你侧面看的时候，会变成一条线，在Cuda光栅化的时候，可能会把它给弄没了。所以作者搞了一个低通滤波器来处理这个问题。

作者的参考文献：

High-quality surface splatting on today's GPUs
ieeexplore.ieee.org/document/1500313

参考的低通滤波器公式

（Part 5）

接下来，就是光栅化问题，2DGS的光栅化与3DGS一模一样。首先为每个2DGS计算空间边界框，然后2DGS根据每个中心的深度进行排序，并根据深度框组织图块。最后，体积累积得到最终的color。

光栅化过程

细讲3

首先3DGS也是参考的下面这篇文献做的工作：

EWA splatting | IEEE Journals & Magazine | IEEE Xplore
ieeexplore.ieee.org/document/1021576

首先，在确定了相机位姿以后，在空间中确立相机此时的视锥体。然后，需要将3D的空间中的3D椭球投影到2D空间中进行渲染。对于3DGS来说，这一步就像雪秋啪的一声撩墙上了，所以叫Splatting。这里涉及到从3D的协方差矩阵，换到2D的协方差矩阵，其可以表示为：

这一部分呢出自我们刚刚推荐过的文章：

八氨合氯化钙：3D Gaussian Splatting中的数学推导

对整个3DGS的像素级的渲染流程，我认为下面这篇文章讲的很好：

A Survey on 3D Gaussian Splatting
arxiv.org/abs/2401.03890

像素级讲渲染过程

（Part 6）

对于2DGS的参数更新及致密化，前期是与3DGS几乎是一模一样的。这里我们直接推荐两篇博客给大家看。这玩意儿，大家看到相应的代码就懂了。非常简单哈：

3DGS学习（六）-- 参数更新
blog.csdn.net/m0_63843473/article/details/136285914?spm=1001.2014.3001.5502

3DGS学习（七）-- 自适应高斯密度控制
blog.csdn.net/m0_63843473/article/details/136286811?spm=1001.2014.3001.5502

这一部分2DGS的作者并没有去细讲，但是其逻辑与3DGS是一样的。

（Part 7）

损失函数，我不是特别想讲。损失函数并不是本文的重点有一说一。总的来说就是2DGS加入了两个新的约束，用来对深度和法线进行监督。

深度畸变损失

法线损失

总损失函数

OK。到这里，整体的2DGS，我们基本就讲完了。

接下来，我们来看看代码：

代码详解

对于代码来说，我提供几乎全文详尽注释的2DGS代码。在本文章中，我们主要关注非常重要的代码即可。

首先，对于所有的开源代码而言，最重要的就是Train.py这个文件。Train几乎囊括了整个2DGS的构建过程，因此这十分重要。首先，我们来看一下Train.py中的执行部分：

if __name__ == "__main__":

    # 创建解析器，初始化操作
    parser = ArgumentParser(description="Training script parameters")
    # 设置模型参数，并使用parser外部数据进行添加
    # 本身模型属性可修改内容：sh_degree，source_path，model_path，image属性，resolution，white_background，data_device，eval，render_items
    lp = ModelParams(parser)
    # 精细化调参，非常多
    op = OptimizationParams(parser)
    # convert_SHs_python, compute_cov3D_python, depth_ratio, debug四个参数
    pp = PipelineParams(parser)
    # parser的额外参数，与上述参数相同
    parser.add_argument('--ip', type=str, default="127.0.0.1")
    parser.add_argument('--port', type=int, default=6009)
    parser.add_argument('--detect_anomaly', action='store_true', default=False)
    parser.add_argument("--test_iterations", nargs="+", type=int, default=[7_000, 30_000])
    parser.add_argument("--save_iterations", nargs="+", type=int, default=[7_000, 30_000])
    parser.add_argument("--quiet", action="store_true")
    parser.add_argument("--checkpoint_iterations", nargs="+", type=int, default=[])
    parser.add_argument("--start_checkpoint", type=str, default = None)
    # 读取参数
    args = parser.parse_args(sys.argv[1:])
    args.save_iterations.append(args.iterations)
    
    print("Optimizing " + args.model_path)

    # 初始化系统状态
    # 这段代码的目的是为了在执行过程中控制标准输出的行为，添加时间戳并在需要时禁止输出，以便在某些场景下更方便地进行调试和记录。
    safe_state(args.quiet)

    # 然后就是启动GUI以及运行训练的代码
    # 这行代码初始化一个 GUI 服务器，使用 args.ip 和 args.port 作为参数。这可能是一个用于监视和控制训练过程的图形用户界面的一部分。
    network_gui.init(args.ip, args.port)
    # 这行代码设置 PyTorch 是否要检测梯度计算中的异常。
    torch.autograd.set_detect_anomaly(args.detect_anomaly)
    # 输入的参数包括：模型的参数（数据集的位置）、优化器的参数、其他pipeline的参数，测试迭代次数、保存迭代次数 、检查点迭代次数 、开始检查点 、调试起点
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint)

    # All done
    print("\nTraining complete.")

对于整个执行部分，前半段全部都是对参数的提取。在2DGS的代码里，主要有三类参数。第一类参数就是模型的参数，具体看arguments/init.py中的ModelParams：

class ModelParams(ParamGroup): 
    def __init__(self, parser, sentinel=False):
        self.sh_degree = 3
        self._source_path = ""
        self._model_path = ""
        self._images = "images"
        self._resolution = -1
        self._white_background = False
        self.data_device = "cuda"
        self.eval = False
        self.render_items = ['RGB', 'Alpha', 'Normal', 'Depth', 'Edge', 'Curvature']
        super().__init__(parser, "Loading Parameters", sentinel)

    def extract(self, args):
        g = super().extract(args)
        g.source_path = os.path.abspath(g.source_path)
        return g

Ok，非常easy哈。基本定义完了。第二类就是PipelineParams，还是在同一个文件里：

class PipelineParams(ParamGroup):
    def __init__(self, parser):
        self.convert_SHs_python = False
        self.compute_cov3D_python = False
        self.depth_ratio = 0.0
        self.debug = False
        super().__init__(parser, "Pipeline Parameters")

第三个就是OptimizationParams，一些调参啥的：

class OptimizationParams(ParamGroup):
    def __init__(self, parser):
        self.iterations = 30_000
        self.position_lr_init = 0.00016
        self.position_lr_final = 0.0000016
        self.position_lr_delay_mult = 0.01
        self.position_lr_max_steps = 30_000
        self.feature_lr = 0.0025
        self.opacity_lr = 0.05
        self.scaling_lr = 0.005
        self.rotation_lr = 0.001
        self.percent_dense = 0.01
        self.lambda_dssim = 0.2
        self.lambda_dist = 0.0
        self.lambda_normal = 0.05
        self.opacity_cull = 0.05

        self.densification_interval = 100
        self.opacity_reset_interval = 3000
        self.densify_from_iter = 500
        self.densify_until_iter = 15_000
        self.densify_grad_threshold = 0.0002
        super().__init__(parser, "Optimization Parameters")

接下来，在获取了参数后，整个执行部分，最重要的就是执行了Training这个函数，Training整个函数呢输入的参数包括：模型的参数（数据集的位置）、优化器的参数、其他pipeline的参数，测试迭代次数、保存迭代次数、检查点迭代次数、开始检查点、调试起点：

def training(dataset, opt, pipe, testing_iterations, saving_iterations, checkpoint_iterations, checkpoint):
    # 初始化迭代次数
    first_iter = 0
    # 设置 TensorBoard 写入器和日志记录器。
    tb_writer = prepare_output_and_logger(dataset)
    # -------------------------------------------------------------------------
    #（重点看，需要转跳）创建一个 GaussianModel 类的实例，输入一系列参数，其参数取自数据集。
    # -------------------------------------------------------------------------
    gaussians = GaussianModel(dataset.sh_degree)
    #（这个类的主要目的是处理场景的初始化、保存和获取相机信息等任务，）
    # 创建一个 Scene 类的实例，使用数据集和之前创建的 GaussianModel 实例作为参数。
    # 这里非常重要，此时已经初始化了整个高斯的点云。
    scene = Scene(dataset, gaussians)
    # 设置 GaussianModel 的训练参数
    gaussians.training_setup(opt)
    # if有预训练模型
    if checkpoint:
        # 通过 torch.load(checkpoint) 加载检查点的模型参数和起始迭代次数
        (model_params, first_iter) = torch.load(checkpoint)
        # 通过 gaussians.restore 恢复模型的状态。
        gaussians.restore(model_params, opt)
    # 设置背景颜色，根据数据集是否有白色背景来选择。
    bg_color = [1, 1, 1] if dataset.white_background else [0, 0, 0]
    # 将背景颜色转化为 PyTorch Tensor，并移到 GPU 上。
    background = torch.tensor(bg_color, dtype=torch.float32, device="cuda")

    # 创建两个 CUDA 事件，用于测量迭代时间。
    iter_start = torch.cuda.Event(enable_timing = True)
    iter_end = torch.cuda.Event(enable_timing = True)


    viewpoint_stack = None
    ema_loss_for_log = 0.0
    ema_dist_for_log = 0.0
    ema_normal_for_log = 0.0

    # 创建一个 tqdm 进度条，用于显示训练进度。
    progress_bar = tqdm(range(first_iter, opt.iterations), desc="Training progress")
    first_iter += 1

    #  接下来开始循环迭代 -------------------------------------------------------------------------------------------------
    for iteration in range(first_iter, opt.iterations + 1):        

        # 用于测量迭代时间。
        iter_start.record()
        # 新学习率
        gaussians.update_learning_rate(iteration)

        # 每 1000 次迭代，增加球谐函数的阶数
        if iteration % 1000 == 0:
            gaussians.oneupSHdegree()

        # 随机选择一个训练相机
        if not viewpoint_stack:
            viewpoint_stack = scene.getTrainCameras().copy()
        viewpoint_cam = viewpoint_stack.pop(randint(0, len(viewpoint_stack)-1))

        render_pkg = render(viewpoint_cam, gaussians, pipe, background)
        image, viewspace_point_tensor, visibility_filter, radii = render_pkg["render"], render_pkg["viewspace_points"], render_pkg["visibility_filter"], render_pkg["radii"]

        # Loss Function 损失函数 写得挺乱 给拆开了
        gt_image = viewpoint_cam.original_image.cuda()
        Ll1 = l1_loss(image, gt_image)
        loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
        
        # 正则化
        # 正态分布
        lambda_normal = opt.lambda_normal if iteration > 7000 else 0.0
        lambda_dist = opt.lambda_dist if iteration > 3000 else 0.0
        rend_dist = render_pkg["rend_dist"]
        rend_normal  = render_pkg['rend_normal']
        surf_normal = render_pkg['surf_normal']
        normal_error = (1 - (rend_normal * surf_normal).sum(dim=0))[None]
        normal_loss = lambda_normal * (normal_error).mean()
        dist_loss = lambda_dist * (rend_dist).mean()

        # 总损失
        total_loss = loss + dist_loss + normal_loss
        # 反向传播
        total_loss.backward()
        # 测量总迭代时间
        iter_end.record()
        # 记录损失的指数移动平均值，并定期更新进度条
        with torch.no_grad():
            # Progress bar
            ema_loss_for_log = 0.4 * loss.item() + 0.6 * ema_loss_for_log
            ema_dist_for_log = 0.4 * dist_loss.item() + 0.6 * ema_dist_for_log
            ema_normal_for_log = 0.4 * normal_loss.item() + 0.6 * ema_normal_for_log


            if iteration % 10 == 0:
                loss_dict = {
                    "Loss": f"{ema_loss_for_log:.{5}f}",
                    "distort": f"{ema_dist_for_log:.{5}f}",
                    "normal": f"{ema_normal_for_log:.{5}f}",
                    "Points": f"{len(gaussians.get_xyz)}"
                }

                progress_bar.set_postfix(loss_dict)

                progress_bar.update(10)
            if iteration == opt.iterations:
                progress_bar.close()

            # 将 L1 loss、总体 loss 和迭代时间写入 TensorBoard。
            if tb_writer is not None:
                tb_writer.add_scalar('train_loss_patches/dist_loss', ema_dist_for_log, iteration)
                tb_writer.add_scalar('train_loss_patches/normal_loss', ema_normal_for_log, iteration)

            training_report(tb_writer, iteration, Ll1, loss, l1_loss, iter_start.elapsed_time(iter_end), testing_iterations, scene, render, (pipe, background))
            # 如果达到保存迭代次数，保存场景
            if (iteration in saving_iterations):
                print("\n[ITER {}] Saving Gaussians".format(iteration))
                scene.save(iteration)

            # 在一定的迭代次数内进行密集化处理
            if iteration < opt.densify_until_iter:
                # 将每个像素位置上的最大半径记录在 max_radii2D 中。这是为了密集化时进行修剪（pruning）操作时的参考。
                gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter], radii[visibility_filter])
                # 将与密集化相关的统计信息添加到 gaussians 模型中，包括视图空间点和可见性过滤器
                gaussians.add_densification_stats(viewspace_point_tensor, visibility_filter)
                # 在指定的迭代次数之后，每隔一定的迭代间隔进行以下密集化操作
                # 大于500次的时候，并且除以100余0
                if iteration > opt.densify_from_iter and iteration % opt.densification_interval == 0:
                    # 据当前迭代次数设置密集化的阈值。如果当前迭代次数大于 opt.opacity_reset_interval，则设置 size_threshold 为 20，否则为 None
                    size_threshold = 20 if iteration > opt.opacity_reset_interval else None
                    # 执行密集化和修剪操作，其中包括梯度阈值、密集化阈值、相机范围和之前计算的 size_threshold。
                    gaussians.densify_and_prune(opt.densify_grad_threshold, opt.opacity_cull, scene.cameras_extent, size_threshold)
                # 在每隔一定迭代次数或在白色背景数据集上的指定迭代次数时，执行以下操作。
                if iteration % opt.opacity_reset_interval == 0 or (dataset.white_background and iteration == opt.densify_from_iter):
                    # 重置模型中的某些参数，涉及到透明度的操作，具体实现可以在 reset_opacity 方法中找到。
                    gaussians.reset_opacity()

            # 执行优化器的步骤，然后清零梯度。
            if iteration < opt.iterations:
                gaussians.optimizer.step()
                gaussians.optimizer.zero_grad(set_to_none = True)

            # 如果达到检查点迭代次数，保存检查点
            if (iteration in checkpoint_iterations):
                print("\n[ITER {}] Saving Checkpoint".format(iteration))
                torch.save((gaussians.capture(), iteration), scene.model_path + "/chkpnt" + str(iteration) + ".pth")



        with torch.no_grad():        
            if network_gui.conn == None:
                network_gui.try_connect(dataset.render_items)
            while network_gui.conn != None:
                try:
                    net_image_bytes = None
                    custom_cam, do_training, keep_alive, scaling_modifer, render_mode = network_gui.receive()
                    if custom_cam != None:
                        render_pkg = render(custom_cam, gaussians, pipe, background, scaling_modifer)   
                        net_image = render_net_image(render_pkg, dataset.render_items, render_mode, custom_cam)
                        net_image_bytes = memoryview((torch.clamp(net_image, min=0, max=1.0) * 255).byte().permute(1, 2, 0).contiguous().cpu().numpy())
                    metrics_dict = {
                        "#": gaussians.get_opacity.shape[0],
                        "loss": ema_loss_for_log
                        # Add more metrics as needed
                    }
                    # Send the data
                    network_gui.send(net_image_bytes, dataset.source_path, metrics_dict)
                    if do_training and ((iteration < int(opt.iterations)) or not keep_alive):
                        break
                except Exception as e:
                    # raise e
                    network_gui.conn = None

对于Training函数，我们就需要非常认真地逐行阅读了。

# 初始化迭代次数
first_iter = 0
# 设置 TensorBoard 写入器和日志记录器。
tb_writer = prepare_output_and_logger(dataset)

首先前两行不用细看。是一些基本参数的设置。然后就是这个函数：

gaussians = GaussianModel(dataset.sh_degree)

这个函数进行了一个Gaussian model的初始化。在scene/gaussian_model.py中定义了这个函数的初始化：

class GaussianModel:


    # 用于设置一些激活函数和变换函数
    def setup_functions(self):
        # 构建协方差矩阵，该函数接受 scaling（尺度）、scaling_modifier（尺度修正因子）、rotation（旋转）作为参数
        # 与原文一致
        """这个地方与3DGS不同"""
        def build_covariance_from_scaling_rotation(center, scaling, scaling_modifier, rotation):

            RS = build_scaling_rotation(torch.cat([scaling * scaling_modifier, torch.ones_like(scaling)], dim=-1), rotation).permute(0,2,1)
            trans = torch.zeros((center.shape[0], 4, 4), dtype=torch.float, device="cuda")
            trans[:,:3,:3] = RS
            trans[:, 3,:3] = center
            trans[:, 3, 3] = 1
            return trans

        # 将尺度激活函数设置为指数函数
        # 原因可能： 缩放必须是正数，而指数函数的返回值一定是正数。
        self.scaling_activation = torch.exp
        # 将尺度逆激活函数设置为对数函数
        self.scaling_inverse_activation = torch.log
        # 将协方差激活函数设置为上述定义的 build_covariance_from_scaling_rotation 函数。
        self.covariance_activation = build_covariance_from_scaling_rotation
        # 将不透明度激活函数设置为 sigmoid 函数，保证（0，1）
        self.opacity_activation = torch.sigmoid
        # 将不透明度逆激活函数设置为一个名为 inverse_sigmoid 的函数
        self.inverse_opacity_activation = inverse_sigmoid
        # 用于归一化旋转矩阵
        self.rotation_activation = torch.nn.functional.normalize


    def __init__(self, sh_degree : int):
        # 球谐阶数
        self.active_sh_degree = 0
        # 最大球谐阶数
        self.max_sh_degree = sh_degree
        # 存储不同信息的张量（tensor）-------------
        # 空间位置
        self._xyz = torch.empty(0)
        self._features_dc = torch.empty(0)
        self._features_rest = torch.empty(0)
        # 椭球形状尺度
        self._scaling = torch.empty(0)
        # 椭球的旋转
        self._rotation = torch.empty(0)
        # 不透明度
        self._opacity = torch.empty(0)
        self.max_radii2D = torch.empty(0)
        self.xyz_gradient_accum = torch.empty(0)
        self.denom = torch.empty(0)
        # 初始化优化器
        self.optimizer = None
        # 初始化百分比密度
        self.percent_dense = 0
        # 初始化空间学习率
        self.spatial_lr_scale = 0
        # 调用setup_functions() 设置各种激活函数和变换函数
        self.setup_functions()

我们这里只看初始化的部分。这里只输入了一个参数，就是球谐函数的阶数，这里设置的是3。在setup_functions这个函数中，设置了协方差矩阵。这里与文中是一样的，求出了RS之类的数据。并且设置了很多激活函数。具体的可以看我的注释。

接下来呢，就进入了本文的重中之重的第一个环节：

scene = Scene(dataset, gaussians)

我们找到scene/init.py这个文件，它定义了整个场景的初始化，也就是初始化了整个2DGS：

class Scene:

    gaussians : GaussianModel

    def __init__(self, args : ModelParams, gaussians : GaussianModel, load_iteration=None, shuffle=True, resolution_scales=[1.0]):
        """b
        :param path: Path to colmap scene main folder.
        """
        self.model_path = args.model_path
        self.loaded_iter = None
        self.gaussians = gaussians

        # 初始化时，未执行
        # 首先，如果load_iteration参数不是None，Scene.__init__会在输出文件夹下的point_cloud/文件夹搜索迭代次数最大的iteration_xxx文件夹
        # （例如有iteration_7000和iteration_30000的文件夹则选取后者），将最大的迭代次数记录到self.loaded_iter

        if load_iteration:
            if load_iteration == -1:
                self.loaded_iter = searchForMaxIteration(os.path.join(self.model_path, "point_cloud"))
            else:
                self.loaded_iter = load_iteration
            print("Loading trained model at iteration {}".format(self.loaded_iter))

        self.train_cameras = {}
        self.test_cameras = {}

        # scene_info的信息有如下：
        # point_cloud=pcd,
        # train_cameras=train_cam_infos,
        # test_cameras=test_cam_infos,
        # nerf_normalization=nerf_normalization,
        # ply_path=ply_path)

        # 这里是判断读取的Colmap还是Blender
        if os.path.exists(os.path.join(args.source_path, "sparse")):
            scene_info = sceneLoadTypeCallbacks["Colmap"](args.source_path, args.images, args.eval)
        elif os.path.exists(os.path.join(args.source_path, "transforms_train.json")):
            print("Found transforms_train.json file, assuming Blender data set!")
            scene_info = sceneLoadTypeCallbacks["Blender"](args.source_path, args.white_background, args.eval)
        else:
            assert False, "Could not recognize scene type!"


        # 这里是None
        # 先进入这个环节，填充相机COLMAP的所有内容
        if not self.loaded_iter:

            with open(scene_info.ply_path, 'rb') as src_file, open(os.path.join(self.model_path, "input.ply") , 'wb') as dest_file:
                dest_file.write(src_file.read())
            json_cams = []
            camlist = []
            if scene_info.test_cameras:
                camlist.extend(scene_info.test_cameras)
            if scene_info.train_cameras:
                camlist.extend(scene_info.train_cameras)
            for id, cam in enumerate(camlist):
                json_cams.append(camera_to_JSON(id, cam))
            with open(os.path.join(self.model_path, "cameras.json"), 'w') as file:
                json.dump(json_cams, file)

        # 执行了随机相片的步骤。在3DGS内需要执行，但是对于4DGS可以不用
        if shuffle:
            random.shuffle(scene_info.train_cameras)  # Multi-res consistent random shuffling
            random.shuffle(scene_info.test_cameras)  # Multi-res consistent random shuffling

        # getNerfppNorm读取所有相机的中心点位置到最远camera的距离 * 1.1
        self.cameras_extent = scene_info.nerf_normalization["radius"]


        # 用【】填充一个camera_list
        for resolution_scale in resolution_scales:
            print("Loading Training Cameras")
            self.train_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.train_cameras, resolution_scale, args)
            print("Loading Test Cameras")
            self.test_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.test_cameras, resolution_scale, args)


        # 没有执行
        # 如果self.loaded_iter有值，则直接读取对应的（已经迭代出来的）场景
        if self.loaded_iter:
            self.gaussians.load_ply(os.path.join(self.model_path,
                                                           "point_cloud",
                                                           "iteration_" + str(self.loaded_iter),
                                                           "point_cloud.ply"))
        else:
        #执行的这里，点云高斯化
        #给所有目前采集到的点云做初始值，也就是COLMAP的值直接就进来
        #俩参数，一个是点云，一个是所有相机的中心点位置到最远camera的距离 * 1.1
            self.gaussians.create_from_pcd(scene_info.point_cloud, self.cameras_extent)

    def save(self, iteration):
        point_cloud_path = os.path.join(self.model_path, "point_cloud/iteration_{}".format(iteration))
        self.gaussians.save_ply(os.path.join(point_cloud_path, "point_cloud.ply"))

    def getTrainCameras(self, scale=1.0):
        return self.train_cameras[scale]

    def getTestCameras(self, scale=1.0):
        return self.test_cameras[scale]

这里通过读取了Colmap的点云、相机等等信息，初始化了点云。初始化主要用到的函数，是gaussian_model中的create_from_pcd函数：

def create_from_pcd(self, pcd : BasicPointCloud, spatial_lr_scale : float):

        # 所有相机的中心点位置到最远camera的距离 * 1.1
        # 根据大量的3DGS解读，应该与学习率有关
        # 防止固定的学习率适配不同尺度的场景时出现问题。
        self.spatial_lr_scale = spatial_lr_scale
        # 点云转tensor送入GPU，实际上就是稀疏点云的3D坐标
        # (N, 3) 这里输出的是所有点云
        fused_point_cloud = torch.tensor(np.asarray(pcd.points)).float().cuda()
        # RGB转球谐函数送入GPU
        # 也是（N,3），应为球谐的直流分量
        # RGB2SH(x) = (x - 0.5) / 0.28209479177387814
        # 0.28209479177387814是1 / (2*sqrt(pi))，是直流分量Y(l=0,m=0)的值
        fused_color = RGB2SH(torch.tensor(np.asarray(pcd.colors)).float().cuda())



        #  RGB三通道球谐的所有系数，大小为(N, 3, (最大球谐阶数 + 1)²), 这部分 3DGS与 2DGS毫无区别
        features = torch.zeros((fused_color.shape[0], 3, (self.max_sh_degree + 1) ** 2)).float().cuda()
        features[:, :3, 0 ] = fused_color
        features[:, 3:, 1:] = 0.0
        print("Number of points at initialisation : ", fused_point_cloud.shape[0])







        # 调用simple_knn的distCUDA2函数，计算点云中的每个点到与其最近的K个点的平均距离的平方
        # dist2的大小应该是(N,)。
        # 首先可以明确的是这句话用来初始化scale，且scale（的平方）不能低于1e-7。
        # 我阅读了一下submodules/simple-knn/simple_knn.cu，大致猜出来了这个是什么意思。
        # distCUDA2函数由simple_knn.cu的SimpleKNN::knn函数实现。
        # KNN意思是K-Nearest Neighbor，即求每一点最近的K个点。
        # simple_knn.cu中令k=3，求得每一点最近的三个点距该点的平均距离。
        # 原理是把3D空间中的每个点用莫顿编码（Morton Encoding）转化为一个1D坐标
        # 用到了能够填满空间的Z曲线
        # 然后对1D坐标进行排序，从而确定离每个点最近的三个点。
        # simple_knn.cu实际上还用了一种加速策略，是将点集分为多个大小为1024的块（box），
        # 在每个块内确定3个最近邻居和它们的平均距离。用平均距离作为Gaussian的scale。
        dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
        #---------------------------------------------------------------------------------------------------------------
        # 因为2DGS中只有2个缩放值
        # 因为scale的激活函数是exp，所以这里存的也不是真的scale，而是ln(scale)。
  # 注意dist2其实是距离的平方，所以这里要开根号。
  # repeat(1, 2)标明两个方向上scale的初始值是相等的。
  # scales的大小：(N, 2) 这是与3DGS完全不同的
        scales = torch.log(torch.sqrt(dist2))[...,None].repeat(1, 2)
        #---------------------------------------------------------------------------------------------------------------

        # 2DGS不是，2DGS使用[0,1]的均匀分布进行初始化
        # 这里与3DGS有明显区别
        rots = torch.rand((fused_point_cloud.shape[0], 4), device="cuda")

        # 完全不同，这里使用：torch.log(x/(1-x))，而不是sigmoid。
        # 因为输入时，透明度是（N, 1）,这里统一后的初始值为-2.1972
        # 原因不明,但这里最终的值，与3DGS保持一致（-2.197）
        opacities = self.inverse_opacity_activation(0.1 * torch.ones((fused_point_cloud.shape[0], 1), dtype=torch.float, device="cuda"))

        # 初始化位置，sh系数（直接+剩余），缩放（3个轴），旋转（四元数），不透明度（逆sigmoid的值）
        # ---------------------------------------------------------------------------------------------------------------
        # 一些函数的解释：requires_grad=True 的作用是让 backward 可以追踪这个参数并且计算它的梯度。
        # 最开始定义你的输入是 requires_grad=True ，那么后续对应的输出也自动具有 requires_grad=True ，如代码中无关联的数值 x ，其 requires_grad 仍等于 False。
        # ---------------------------------------------------------------------------------------------------------------
        # 这里就是直接初始化每个单独的点云。
        # ---------------------------------------------------------------------------------------------------------------
        # 一些函数的解释：nn.Parameter():
        # 将一个不可训练的tensor转换成可以训练的类型parameter，并将这个parameter绑定到这个module里面。
        # 即在定义网络时这个tensor就是一个可以训练的参数了。使用这个函数的目的也是想让某些变量在学习的过程中不断的修改其值以达到最优化。
        # 输出点云坐标（N,3）
        self._xyz = nn.Parameter(fused_point_cloud.requires_grad_(True))
        # RGB三个通道的直流分量，(N, 1, 3)， 3DGS是（N, 3, 1）
        self._features_dc = nn.Parameter(features[:,:,0:1].transpose(1, 2).contiguous().requires_grad_(True))
        #  RGB三个通道的高阶分量，(N, (最大球谐阶数 + 1)² - 1, 3), 这里不太一样，与上面相同后两位换了位置
        self._features_rest = nn.Parameter(features[:,:,1:].transpose(1, 2).contiguous().requires_grad_(True))
        # 这里的缩放尺度，为（N,2）
        self._scaling = nn.Parameter(scales.requires_grad_(True))
        # 旋转4原数 （N,4）
        self._rotation = nn.Parameter(rots.requires_grad_(True))
        # (N.1)透明度
        self._opacity = nn.Parameter(opacities.requires_grad_(True))
        # 投影到2D时, 每个2D gaussian最大的半径，这里初始为（N，）
        self.max_radii2D = torch.zeros((self.get_xyz.shape[0]), device="cuda")

我这里解释得非常详细哈。大家可以仔细看看。除此之外，最重要的，就是train环节中得填充致密化环节。

# 在一定的迭代次数内进行密集化处理
            if iteration < opt.densify_until_iter:
                # 将每个像素位置上的最大半径记录在 max_radii2D 中。这是为了密集化时进行修剪（pruning）操作时的参考。
                gaussians.max_radii2D[visibility_filter] = torch.max(gaussians.max_radii2D[visibility_filter], radii[visibility_filter])
                # 将与密集化相关的统计信息添加到 gaussians 模型中，包括视图空间点和可见性过滤器
                gaussians.add_densification_stats(viewspace_point_tensor, visibility_filter)
                # 在指定的迭代次数之后，每隔一定的迭代间隔进行以下密集化操作
                # 大于500次的时候，并且除以100余0
                if iteration > opt.densify_from_iter and iteration % opt.densification_interval == 0:
                    # 据当前迭代次数设置密集化的阈值。如果当前迭代次数大于 opt.opacity_reset_interval，则设置 size_threshold 为 20，否则为 None
                    size_threshold = 20 if iteration > opt.opacity_reset_interval else None
                    # 执行密集化和修剪操作，其中包括梯度阈值、密集化阈值、相机范围和之前计算的 size_threshold。
                    gaussians.densify_and_prune(opt.densify_grad_threshold, opt.opacity_cull, scene.cameras_extent, size_threshold)
                # 在每隔一定迭代次数或在白色背景数据集上的指定迭代次数时，执行以下操作。
                if iteration % opt.opacity_reset_interval == 0 or (dataset.white_background and iteration == opt.densify_from_iter):
                    # 重置模型中的某些参数，涉及到透明度的操作，具体实现可以在 reset_opacity 方法中找到。
                    gaussians.reset_opacity()

            # 执行优化器的步骤，然后清零梯度。
            if iteration < opt.iterations:
                gaussians.optimizer.step()
                gaussians.optimizer.zero_grad(set_to_none = True)

            # 如果达到检查点迭代次数，保存检查点
            if (iteration in checkpoint_iterations):
                print("\n[ITER {}] Saving Checkpoint".format(iteration))
                torch.save((gaussians.capture(), iteration), scene.model_path + "/chkpnt" + str(iteration) + ".pth")

这部分，主要调用的是gaussian_model中的densify_and_split、densify_and_prune、densify_and_clone函数：

def densify_and_split(self, grads, grad_threshold, scene_extent, N=2):
        # 获取初始点的数量。
        n_init_points = self.get_xyz.shape[0]
        # 创建一个长度为初始点数量的梯度张量，并将计算得到的梯度填充到其中。
        padded_grad = torch.zeros((n_init_points), device="cuda")
        padded_grad[:grads.shape[0]] = grads.squeeze()
        # 创建一个掩码，标记那些梯度大于等于指定阈值的点。
        selected_pts_mask = torch.where(padded_grad >= grad_threshold, True, False)
        # 一步过滤掉那些缩放（scaling）大于一定百分比的场景范围的点
        # 这里是一个高斯分裂的过程：被分裂的Gaussians满足两个条件：
        #   1. （平均）梯度过大；
        #   2. 在某个方向的最大缩放大于一个阈值。
        #   参照论文5.2节“On the other hand...”一段，大Gaussian被分裂成两个小Gaussians，
        #   其放缩被除以φ=1.6，且位置是以原先的大Gaussian作为概率密度函数进行采样的。
        selected_pts_mask = torch.logical_and(selected_pts_mask,
                                              torch.max(self.get_scaling, dim=1).values > self.percent_dense*scene_extent)


        # 为每个点生成新的样本，其中 stds 是点的缩放，means 是均值， 第一步是一样的
        # 这里从新的点云中更新缩放因子，并且进行同样的复制一份
        stds = self.get_scaling[selected_pts_mask].repeat(N,1)
        # 这里我大致明白，本身获取的是（su, sv）,为了与旋转矩阵相对应，构建（su,sv,0）
        stds = torch.cat([stds, 0 * torch.ones_like(stds[:,:1])], dim=-1)
        # 这里就是一个同样大小的矩阵
        means = torch.zeros_like(stds)
        # 使用均值和标准差生成样本
        samples = torch.normal(mean=means, std=stds)
        # 为每个点构建旋转矩阵，并将其重复 N 次。
        rots = build_rotation(self._rotation[selected_pts_mask]).repeat(N,1,1)
        # 将旋转后的样本点添加到原始点的位置。
        new_xyz = torch.bmm(rots, samples.unsqueeze(-1)).squeeze(-1) + self.get_xyz[selected_pts_mask].repeat(N, 1)
        # 生成新的缩放参数。
        new_scaling = self.scaling_inverse_activation(self.get_scaling[selected_pts_mask].repeat(N,1) / (0.8*N))
        # 将旋转、原始点特征、等等重复N次
        new_rotation = self._rotation[selected_pts_mask].repeat(N,1)
        new_features_dc = self._features_dc[selected_pts_mask].repeat(N,1,1)
        new_features_rest = self._features_rest[selected_pts_mask].repeat(N,1,1)
        new_opacity = self._opacity[selected_pts_mask].repeat(N,1)

        # 调用另一个方法 densification_postfix，该方法对新生成的点执行后处理操作（此处跟densify_and_clone一样）。
        self.densification_postfix(new_xyz, new_features_dc, new_features_rest, new_opacity, new_scaling, new_rotation)
        # 创建一个修剪（pruning）的过滤器，将新生成的点添加到原始点的掩码之后。
        prune_filter = torch.cat((selected_pts_mask, torch.zeros(N * selected_pts_mask.sum(), device="cuda", dtype=bool)))
        # 根据修剪过滤器，修剪模型中的一些参数。
        self.prune_points(prune_filter)

    def densify_and_clone(self, grads, grad_threshold, scene_extent):
        # 建一个掩码，标记满足梯度条件的点。具体来说，对于每个点，计算其梯度的L2范数，如果大于等于指定的梯度阈值，则标记为True，否则标记为False。
        # 提取出大于阈值`grad_threshold`且缩放参数较小（小于self.percent_dense * scene_extent）的Gaussians，在下面进行克隆

        selected_pts_mask = torch.where(torch.norm(grads, dim=-1) >= grad_threshold, True, False)
        selected_pts_mask = torch.logical_and(selected_pts_mask,
                                              torch.max(self.get_scaling, dim=1).values <= self.percent_dense*scene_extent)
        # 在上述掩码的基础上，进一步过滤掉那些缩放（scaling）大于一定百分比（self.percent_dense）的场景范围（scene_extent）的点。
        # 这样可以确保新添加的点不会太远离原始数据。
        # 根据掩码选取符合条件的点的其他特征，如颜色、透明度、缩放和旋转等。

        new_xyz = self._xyz[selected_pts_mask]
        new_features_dc = self._features_dc[selected_pts_mask]
        new_features_rest = self._features_rest[selected_pts_mask]
        new_opacities = self._opacity[selected_pts_mask]
        new_scaling = self._scaling[selected_pts_mask]
        new_rotation = self._rotation[selected_pts_mask]

        self.densification_postfix(new_xyz, new_features_dc, new_features_rest, new_opacities, new_scaling, new_rotation)


    # 执行密集化和修剪操作
    def densify_and_prune(self, max_grad, min_opacity, extent, max_screen_size):

        # 计算密度估计的梯度
        grads = self.xyz_gradient_accum / self.denom
        # 将梯度中的 NaN（非数值）值设置为零，以处理可能的数值不稳定性
        grads[grads.isnan()] = 0.0



        # 对under reconstruction的区域进行复制操作，为了稠密化
        self.densify_and_clone(grads, max_grad, extent)
        # 对over reconstruction的区域进行分裂操作，为了稠密化
        self.densify_and_split(grads, max_grad, extent)


        # 接下来移除一些Gaussians，它们满足下列要求中的一个：
        # 1. 接近透明（不透明度小于min_opacity）
        # 2. 在某个相机视野里出现过的最大2D半径大于屏幕（像平面）大小
        # 3. 在某个方向的最大缩放大于0.1 * extent（也就是说很长的长条形也是会被移除的）
        # 创建一个掩码，标记那些透明度小于指定阈值的点。.squeeze() 用于去除掩码中的单维度。
        prune_mask = (self.get_opacity < min_opacity).squeeze()
        # 设置相机的范围
        if max_screen_size:
            # 创建一个掩码，标记在图像空间中半径大于指定阈值的点。
            big_points_vs = self.max_radii2D > max_screen_size
            # 创建一个掩码，标记在世界空间中尺寸大于指定阈值的点。
            big_points_ws = self.get_scaling.max(dim=1).values > 0.1 * extent
            # 将这两个掩码与先前的透明度掩码进行逻辑或操作，得到最终的修剪掩码。
            prune_mask = torch.logical_or(torch.logical_or(prune_mask, big_points_vs), big_points_ws)
        # 根据修剪掩码，修剪模型中的一些参数
        self.prune_points(prune_mask)
        # 清理 GPU 缓存，释放一些内存
        torch.cuda.empty_cache()


    # 统计坐标的累积梯度和均值的分母（即迭代步数）
    def add_densification_stats(self, viewspace_point_tensor, update_filter):
        self.xyz_gradient_accum[update_filter] += torch.norm(viewspace_point_tensor.grad[update_filter], dim=-1, keepdim=True)
        self.denom[update_filter] += 1

上述代码包含了密集化、修建、分裂等步骤。除此之外，我认为最重要的，就是Splatting的渲染过程。这部分在submodules/diff-surfel-rasterization/cuda_rasterizer/forward.cu中。2DGS在这部分，与3DGS有着明显的不同，因此也可以细细品味这部分：

#include "forward.h"
#include "auxiliary.h"
#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>
namespace cg = cooperative_groups;

// 该代码实现了一个名为 `computeColorFromSH` 的函数，它用于计算基于 SH (Surface Hemi-Sphere) 光线着色模型的颜色
__device__ glm::vec3 computeColorFromSH(int idx, int deg, int max_coeffs, const glm::vec3* means, glm::vec3 campos, const float* shs, bool* clamped)
{

/*
* `idx`：索引，表示在数组中的索引。
* `deg`：光线着色模型的阶数。
* `max_coeffs`：SH 函数的系数数量。
* `means`：包含模型中每个点的平均颜色。
* `campos`：观察者的位置。
* `shs`：包含 SH 函数系数的指针。
* `clamped`：一个 bool 数组，用于记录是否颜色被限制
*/


/*
1. **计算点位置和方向向量：**计算点与观察者的距离和方向向量，并标准化方向向量。
2. **获取 SH 函数系数：**根据索引从 `shs` 中获取相应的 SH 函数系数。
3. **计算颜色：**根据光线着色模型的阶数，逐步计算颜色。
4. **限制颜色：**如果颜色值小于 0，则将其限制为 0，并记录是否颜色被限制。
5. **返回颜色：**返回计算出的颜色，以及是否颜色被限制。
*/
 glm::vec3 pos = means[idx];
 glm::vec3 dir = pos - campos;
 dir = dir / glm::length(dir);

 glm::vec3* sh = ((glm::vec3*)shs) + idx * max_coeffs;
 glm::vec3 result = SH_C0 * sh[0];

 if (deg > 0)
 {
  float x = dir.x;
  float y = dir.y;
  float z = dir.z;
  result = result - SH_C1 * y * sh[1] + SH_C1 * z * sh[2] - SH_C1 * x * sh[3];

  if (deg > 1)
  {
   float xx = x * x, yy = y * y, zz = z * z;
   float xy = x * y, yz = y * z, xz = x * z;
   result = result +
    SH_C2[0] * xy * sh[4] +
    SH_C2[1] * yz * sh[5] +
    SH_C2[2] * (2.0f * zz - xx - yy) * sh[6] +
    SH_C2[3] * xz * sh[7] +
    SH_C2[4] * (xx - yy) * sh[8];

   if (deg > 2)
   {
    result = result +
     SH_C3[0] * y * (3.0f * xx - yy) * sh[9] +
     SH_C3[1] * xy * z * sh[10] +
     SH_C3[2] * y * (4.0f * zz - xx - yy) * sh[11] +
     SH_C3[3] * z * (2.0f * zz - 3.0f * xx - 3.0f * yy) * sh[12] +
     SH_C3[4] * x * (4.0f * zz - xx - yy) * sh[13] +
     SH_C3[5] * z * (xx - yy) * sh[14] +
     SH_C3[6] * x * (xx - 3.0f * yy) * sh[15];
   }
  }
 }
 result += 0.5f;

 // RGB colors are clamped to positive values. If values are
 // clamped, we need to keep track of this for the backward pass.
 clamped[3 * idx + 0] = (result.x < 0);
 clamped[3 * idx + 1] = (result.y < 0);
 clamped[3 * idx + 2] = (result.z < 0);
 return glm::max(result, 0.0f);
}








// 计算仿射变换齐次矩阵------------------------------------------------------------------------------------------------------------
// Compute a 2D-to-2D mapping matrix from a tangent plane into a image plane
// given a 2D gaussian parameters.
// glm::mat 生成矩阵 后面+数字表示维度
// glm::vec 生成向量 后面+数字表示维度
__device__ void compute_transmat(


 const float3& p_orig,
 const glm::vec2 scale,
 float mod,
 const glm::vec4 rot,
 const float* projmatrix,
 const float* viewmatrix,
 const int W,
 const int H, 
 glm::mat3 &T,
 float3 &normal
) {
    // rot转换为旋转矩阵R
 glm::mat3 R = quat_to_rotmat(rot);
 // 函数将缩放向量 `scale` 和缩放因子 `mod` 转换为缩放矩阵 `S`
 glm::mat3 S = scale_to_mat(scale, mod);
 // 计算变换矩阵 `L`
 glm::mat3 L = R * S;

 // 计算世界坐标到 NDC 的矩阵
 // splat2world矩阵 将 Gaussians 的中心从局部坐标转换为相机坐标。
 glm::mat3x4 splat2world = glm::mat3x4(
  glm::vec4(L[0], 0.0),
  glm::vec4(L[1], 0.0),
  glm::vec4(p_orig.x, p_orig.y, p_orig.z, 1)
 );
    // world2ndc矩阵 将世界坐标转换为 NDC坐标
 glm::mat4 world2ndc = glm::mat4(
  projmatrix[0], projmatrix[4], projmatrix[8], projmatrix[12],
  projmatrix[1], projmatrix[5], projmatrix[9], projmatrix[13],
  projmatrix[2], projmatrix[6], projmatrix[10], projmatrix[14],
  projmatrix[3], projmatrix[7], projmatrix[11], projmatrix[15]
 );
    // 矩阵将 NDC坐标转换为像素坐标
 glm::mat3x4 ndc2pix = glm::mat3x4(
  glm::vec4(float(W) / 2.0, 0.0, 0.0, float(W-1) / 2.0),
  glm::vec4(0.0, float(H) / 2.0, 0.0, float(H-1) / 2.0),
  glm::vec4(0.0, 0.0, 0.0, 1.0)
 );
    // 计算传输矩阵 T
 T = glm::transpose(splat2world) * world2ndc * ndc2pix;
 // 计算法线 normal
 normal = transformVec4x3({L[2].x, L[2].y, L[2].z}, viewmatrix);

}


// --------------------------------------------------------------------------------------------------------------------
// 该代码计算 2D 高斯分布的边界框和中心，并将中心用于创建低通滤波器
// --------------------------------------------------------------------------------------------------------------------
/*
`T`：3x3 矩阵，用于转换图像坐标到特征空间。上一个函数已经计算了
* `cutoff`： 高斯分布的截止频率。
* `point_image`： 指标图像上的点坐标。
* `extent`：bounding box 的扩展。
*/
__device__ bool compute_aabb(
 glm::mat3 T, 
 float cutoff, 
 float2& point_image,
 float2 & extent
) {
    // 计算 T 矩阵的逆向，并将 T0、T1 和 T3 值计算
    // T0、T1 和 T3 值是 T 矩阵的第 0、第 1 和第 2 行，分别对应特征空间坐标的三个维度
 float3 T0 = {T[0][0], T[0][1], T[0][2]};
 float3 T1 = {T[1][0], T[1][1], T[1][2]};
 float3 T3 = {T[2][0], T[2][1], T[2][2]};

 // 计算 AABB
 // 计算 temp_point 值
 // temp_point 值为 (cutoff * cutoff, cutoff * cutoff, -1.0f)，其中cutoff 是高斯分布的截止频率
 float3 temp_point = {cutoff * cutoff, cutoff * cutoff, -1.0f};
 // 计算 distance 和 f 值
 // distance 是从 T3 到 temp_point 的距离。
 // f 值是 distance 的倒数，乘以 temp_point。
 float distance = sumf3(T3 * T3 * temp_point);
 float3 f = (1 / distance) * temp_point;

 if (distance == 0.0) return false;

    // `sumf3` 和 `maxf2` 等函数用于计算向量和矩阵的和、最大值等操作。
 point_image = {
  sumf3(f * T0 * T3),
  sumf3(f * T1 * T3)
 };  
 
 float2 temp = {
  sumf3(f * T0 * T0),
  sumf3(f * T1 * T1)
 };
 float2 half_extend = point_image * point_image - temp;
 extent = sqrtf2(maxf2(1e-4, half_extend));
 return true;
}

// 名为 `preprocessCUDA`，它负责每个 Gaussian 前处理，并将数据转换为 raster 化
template<int C>
__global__ void preprocessCUDA(int P, int D, int M,
 const float* orig_points,
 const glm::vec2* scales,
 const float scale_modifier,
 const glm::vec4* rotations,
 const float* opacities,
 const float* shs,
 bool* clamped,
 const float* transMat_precomp,
 const float* colors_precomp,
 const float* viewmatrix,
 const float* projmatrix,
 const glm::vec3* cam_pos,
 const int W, int H,
 const float tan_fovx, const float tan_fovy,
 const float focal_x, const float focal_y,
 int* radii,
 float2* points_xy_image,
 float* depths,
 float* transMats,
 float* rgb,
 float4* normal_opacity,
 const dim3 grid,
 uint32_t* tiles_touched,
 bool prefiltered)
{
 auto idx = cg::this_grid().thread_rank();
 if (idx >= P)
  return;

 // Initialize radius and touched tiles to 0. If this isn't changed,
 // this Gaussian will not be processed further.
 radii[idx] = 0;
 tiles_touched[idx] = 0;

 // Perform near culling, quit if outside.
 float3 p_view;
 if (!in_frustum(idx, orig_points, viewmatrix, projmatrix, prefiltered, p_view))
  return;
 
 // Compute transformation matrix
 glm::mat3 T;
 float3 normal;
 if (transMat_precomp == nullptr)
 {
  compute_transmat(((float3*)orig_points)[idx], scales[idx], scale_modifier, rotations[idx], projmatrix, viewmatrix, W, H, T, normal);
  float3 *T_ptr = (float3*)transMats;
  T_ptr[idx * 3 + 0] = {T[0][0], T[0][1], T[0][2]};
  T_ptr[idx * 3 + 1] = {T[1][0], T[1][1], T[1][2]};
  T_ptr[idx * 3 + 2] = {T[2][0], T[2][1], T[2][2]};
 } else {
  glm::vec3 *T_ptr = (glm::vec3*)transMat_precomp;
  T = glm::mat3(
   T_ptr[idx * 3 + 0], 
   T_ptr[idx * 3 + 1],
   T_ptr[idx * 3 + 2]
  );
  normal = make_float3(0.0, 0.0, 1.0);
 }

#if DUAL_VISIABLE
 float cos = -sumf3(p_view * normal);
 if (cos == 0) return;
 float multiplier = cos > 0 ? 1: -1;
 normal = multiplier * normal;
#endif

#if TIGHTBBOX // no use in the paper, but it indeed help speeds.
 // the effective extent is now depended on the opacity of gaussian.
 float cutoff = sqrtf(max(9.f + 2.f * logf(opacities[idx]), 0.000001));
#else
 float cutoff = 3.0f;
#endif

 // Compute center and radius
 float2 point_image;
 float radius;
 {
  float2 extent;
  bool ok = compute_aabb(T, cutoff, point_image, extent);
  if (!ok) return;
  radius = ceil(max(extent.x, extent.y));
 }

 uint2 rect_min, rect_max;
 getRect(point_image, radius, rect_min, rect_max, grid);
 if ((rect_max.x - rect_min.x) * (rect_max.y - rect_min.y) == 0)
  return;

 // Compute colors 
 if (colors_precomp == nullptr) {
  glm::vec3 result = computeColorFromSH(idx, D, M, (glm::vec3*)orig_points, *cam_pos, shs, clamped);
  rgb[idx * C + 0] = result.x;
  rgb[idx * C + 1] = result.y;
  rgb[idx * C + 2] = result.z;
 }

 depths[idx] = p_view.z;
 radii[idx] = (int)radius;
 points_xy_image[idx] = point_image;
 normal_opacity[idx] = {normal.x, normal.y, normal.z, opacities[idx]};
 tiles_touched[idx] = (rect_max.y - rect_min.y) * (rect_max.x - rect_min.x);
}

// Main rasterization method. Collaboratively works on one tile per
// block, each thread treats one pixel. Alternates between fetching 
// and rasterizing data.
template <uint32_t CHANNELS>
__global__ void __launch_bounds__(BLOCK_X * BLOCK_Y)
renderCUDA(
 const uint2* __restrict__ ranges,
 const uint32_t* __restrict__ point_list,
 int W, int H,
 float focal_x, float focal_y,
 const float2* __restrict__ points_xy_image,
 const float* __restrict__ features,
 const float* __restrict__ transMats,
 const float* __restrict__ depths,
 const float4* __restrict__ normal_opacity,
 float* __restrict__ final_T,
 uint32_t* __restrict__ n_contrib,
 const float* __restrict__ bg_color,
 float* __restrict__ out_color,
 float* __restrict__ out_others)
{
 // Identify current tile and associated min/max pixel range.
 auto block = cg::this_thread_block();
 uint32_t horizontal_blocks = (W + BLOCK_X - 1) / BLOCK_X;
 uint2 pix_min = { block.group_index().x * BLOCK_X, block.group_index().y * BLOCK_Y };
 uint2 pix_max = { min(pix_min.x + BLOCK_X, W), min(pix_min.y + BLOCK_Y , H) };
 uint2 pix = { pix_min.x + block.thread_index().x, pix_min.y + block.thread_index().y };
 uint32_t pix_id = W * pix.y + pix.x;
 float2 pixf = { (float)pix.x, (float)pix.y};

 // Check if this thread is associated with a valid pixel or outside.
 bool inside = pix.x < W&& pix.y < H;
 // Done threads can help with fetching, but don't rasterize
 bool done = !inside;

 // Load start/end range of IDs to process in bit sorted list.
 uint2 range = ranges[block.group_index().y * horizontal_blocks + block.group_index().x];
 const int rounds = ((range.y - range.x + BLOCK_SIZE - 1) / BLOCK_SIZE);
 int toDo = range.y - range.x;

 // Allocate storage for batches of collectively fetched data.
 __shared__ int collected_id[BLOCK_SIZE];
 __shared__ float2 collected_xy[BLOCK_SIZE];
 __shared__ float4 collected_normal_opacity[BLOCK_SIZE];
 __shared__ float3 collected_Tu[BLOCK_SIZE];
 __shared__ float3 collected_Tv[BLOCK_SIZE];
 __shared__ float3 collected_Tw[BLOCK_SIZE];

 // Initialize helper variables
 float T = 1.0f;
 uint32_t contributor = 0;
 uint32_t last_contributor = 0;
 float C[CHANNELS] = { 0 };


#if RENDER_AXUTILITY
 // render axutility ouput
 float N[3] = {0};
 float D = { 0 };
 float M1 = {0};
 float M2 = {0};
 float distortion = {0};
 float median_depth = {0};
 // float median_weight = {0};
 float median_contributor = {-1};

#endif

 // Iterate over batches until all done or range is complete
 for (int i = 0; i < rounds; i++, toDo -= BLOCK_SIZE)
 {
  // End if entire block votes that it is done rasterizing
  int num_done = __syncthreads_count(done);
  if (num_done == BLOCK_SIZE)
   break;

  // Collectively fetch per-Gaussian data from global to shared
  int progress = i * BLOCK_SIZE + block.thread_rank();
  if (range.x + progress < range.y)
  {
   int coll_id = point_list[range.x + progress];
   collected_id[block.thread_rank()] = coll_id;
   collected_xy[block.thread_rank()] = points_xy_image[coll_id];
   collected_normal_opacity[block.thread_rank()] = normal_opacity[coll_id];
   collected_Tu[block.thread_rank()] = {transMats[9 * coll_id+0], transMats[9 * coll_id+1], transMats[9 * coll_id+2]};
   collected_Tv[block.thread_rank()] = {transMats[9 * coll_id+3], transMats[9 * coll_id+4], transMats[9 * coll_id+5]};
   collected_Tw[block.thread_rank()] = {transMats[9 * coll_id+6], transMats[9 * coll_id+7], transMats[9 * coll_id+8]};
  }
  block.sync();

  // Iterate over current batch
  for (int j = 0; !done && j < min(BLOCK_SIZE, toDo); j++)
  {
   // Keep track of current position in range
   contributor++;

   // Fisrt compute two homogeneous planes, See Eq. (8)
   const float2 xy = collected_xy[j];
   const float3 Tu = collected_Tu[j];
   const float3 Tv = collected_Tv[j];
   const float3 Tw = collected_Tw[j];
   float3 k = pix.x * Tw - Tu;
   float3 l = pix.y * Tw - Tv;
   float3 p = cross(k, l);
   if (p.z == 0.0) continue;
   float2 s = {p.x / p.z, p.y / p.z};
   float rho3d = (s.x * s.x + s.y * s.y); 
   float2 d = {xy.x - pixf.x, xy.y - pixf.y};
   float rho2d = FilterInvSquare * (d.x * d.x + d.y * d.y); 

   // compute intersection and depth
   float rho = min(rho3d, rho2d);
   float depth = (rho3d <= rho2d) ? (s.x * Tw.x + s.y * Tw.y) + Tw.z : Tw.z; 
   if (depth < near_n) continue;
   float4 nor_o = collected_normal_opacity[j];
   float normal[3] = {nor_o.x, nor_o.y, nor_o.z};
   float opa = nor_o.w;

   float power = -0.5f * rho;
   if (power > 0.0f)
    continue;

   // Eq. (2) from 3D Gaussian splatting paper.
   // Obtain alpha by multiplying with Gaussian opacity
   // and its exponential falloff from mean.
   // Avoid numerical instabilities (see paper appendix). 
   float alpha = min(0.99f, opa * exp(power));
   if (alpha < 1.0f / 255.0f)
    continue;
   float test_T = T * (1 - alpha);
   if (test_T < 0.0001f)
   {
    done = true;
    continue;
   }

   float w = alpha * T;
#if RENDER_AXUTILITY
   // Render depth distortion map
   // Efficient implementation of distortion loss, see 2DGS' paper appendix.
   float A = 1-T;
   float m = far_n / (far_n - near_n) * (1 - near_n / depth);
   distortion += (m * m * A + M2 - 2 * m * M1) * w;
   D  += depth * w;
   M1 += m * w;
   M2 += m * m * w;

   if (T > 0.5) {
    median_depth = depth;
    // median_weight = w;
    median_contributor = contributor;
   }
   // Render normal map
   for (int ch=0; ch<3; ch++) N[ch] += normal[ch] * w;
#endif

   // Eq. (3) from 3D Gaussian splatting paper.
   for (int ch = 0; ch < CHANNELS; ch++)
    C[ch] += features[collected_id[j] * CHANNELS + ch] * w;
   T = test_T;

   // Keep track of last range entry to update this
   // pixel.
   last_contributor = contributor;
  }
 }

 // All threads that treat valid pixel write out their final
 // rendering data to the frame and auxiliary buffers.
 if (inside)
 {
  final_T[pix_id] = T;
  n_contrib[pix_id] = last_contributor;
  for (int ch = 0; ch < CHANNELS; ch++)
   out_color[ch * H * W + pix_id] = C[ch] + T * bg_color[ch];

#if RENDER_AXUTILITY
  n_contrib[pix_id + H * W] = median_contributor;
  final_T[pix_id + H * W] = M1;
  final_T[pix_id + 2 * H * W] = M2;
  out_others[pix_id + DEPTH_OFFSET * H * W] = D;
  out_others[pix_id + ALPHA_OFFSET * H * W] = 1 - T;
  for (int ch=0; ch<3; ch++) out_others[pix_id + (NORMAL_OFFSET+ch) * H * W] = N[ch];
  out_others[pix_id + MIDDEPTH_OFFSET * H * W] = median_depth;
  out_others[pix_id + DISTORTION_OFFSET * H * W] = distortion;
  // out_others[pix_id + MEDIAN_WEIGHT_OFFSET * H * W] = median_weight;
#endif
 }
}

void FORWARD::render(
 const dim3 grid, dim3 block,
 const uint2* ranges,
 const uint32_t* point_list,
 int W, int H,
 float focal_x, float focal_y,
 const float2* means2D,
 const float* colors,
 const float* transMats,
 const float* depths,
 const float4* normal_opacity,
 float* final_T,
 uint32_t* n_contrib,
 const float* bg_color,
 float* out_color,
 float* out_others)
{
 renderCUDA<NUM_CHANNELS> << <grid, block >> > (
  ranges,
  point_list,
  W, H,
  focal_x, focal_y,
  means2D,
  colors,
  transMats,
  depths,
  normal_opacity,
  final_T,
  n_contrib,
  bg_color,
  out_color,
  out_others);
}

void FORWARD::preprocess(int P, int D, int M,
 const float* means3D,
 const glm::vec2* scales,
 const float scale_modifier,
 const glm::vec4* rotations,
 const float* opacities,
 const float* shs,
 bool* clamped,
 const float* transMat_precomp,
 const float* colors_precomp,
 const float* viewmatrix,
 const float* projmatrix,
 const glm::vec3* cam_pos,
 const int W, const int H,
 const float focal_x, const float focal_y,
 const float tan_fovx, const float tan_fovy,
 int* radii,
 float2* means2D,
 float* depths,
 float* transMats,
 float* rgb,
 float4* normal_opacity,
 const dim3 grid,
 uint32_t* tiles_touched,
 bool prefiltered)
{
 preprocessCUDA<NUM_CHANNELS> << <(P + 255) / 256, 256 >> > (
  P, D, M,
  means3D,
  scales,
  scale_modifier,
  rotations,
  opacities,
  shs,
  clamped,
  transMat_precomp,
  colors_precomp,
  viewmatrix, 
  projmatrix,
  cam_pos,
  W, H,
  tan_fovx, tan_fovy,
  focal_x, focal_y,
  radii,
  means2D,
  depths,
  transMats,
  rgb,
  normal_opacity,
  grid,
  tiles_touched,
  prefiltered
  );
}

代码+注释下载

所有的注释，我基本上对应了原文中，进行了标注。我打包后已经上传到CSDN平台和百度云平台，供大家下载查看。后续我会更新一下VastGaussian、GaussianPRO以及4DGS等环节，欢迎大家关注我。

请注意，我上传的版本，是带有查看器的版本。其中，diff-surfel-rasterization和simple-knn我都是安装好了的。所有运行方式请查看2DGS的Github代码主页。

CSDN版本（不带数据集）：

404
download.csdn.net/download/m0_57628341/89537380

百度云盘链接（带DTU数据集）：

百度云下载地址
pan.baidu.com/s/11MDbtQ9KKAm0Pr1GE9Mtug?pwd=lazp
密码：lazp

投稿作者为『自动驾驶之心知识星球』特邀嘉宾，欢迎加入交流！重磅，自动驾驶之心科研论文辅导来啦，申博、CCF系列、SCI、EI、毕业论文、比赛辅导等多个方向，欢迎联系我们！

① 全网独家视频课程

BEV感知、BEV模型部署、BEV目标跟踪、毫米波雷达视觉融合、多传感器标定、多传感器融合、多模态3D目标检测、车道线检测、轨迹预测、在线高精地图、世界模型、点云3D目标检测、目标跟踪、Occupancy、cuda与TensorRT模型部署、大模型与自动驾驶、Nerf、语义分割、自动驾驶仿真、传感器部署、决策规划、轨迹预测等多个方向学习视频（扫码即可学习）

网页端官网：www.zdjszx.com

② 国内首个自动驾驶学习社区

国内最大最专业，近3000人的交流社区，已得到大多数自动驾驶公司的认可！涉及30+自动驾驶技术栈学习路线，从0到一带你入门自动驾驶感知（2D/3D检测、语义分割、车道线、BEV感知、Occupancy、多传感器融合、多传感器标定、目标跟踪）、自动驾驶定位建图（SLAM、高精地图、局部在线地图）、自动驾驶规划控制/轨迹预测等领域技术方案、大模型、端到端等，更有行业动态和岗位发布！欢迎扫描下方二维码，加入自动驾驶之心知识星球，这是一个真正有干货的地方，与领域大佬交流入门、学习、工作、跳槽上的各类难题，日常分享论文+代码+视频