小白科研笔记：剖析SA-SSD的Voxel生成和Anchor机制以及稀疏卷积特征变换细节

本文链接：https://blog.csdn.net/qq_39732684/article/details/105188258

1. 引言

对于3D目标检测算法SA-SSD，考虑到我还是个大白，代码层面上有 $N$ 个细节我还不是特别懂。具体而言，我会解决以下几个问题：

Voxel是怎么生成的，数据格式是什么？
Anchor是怎么生成的，数据格式是什么？以及Anchor Mask是什么？
Anchor和Anchor Mask用在哪里？
Anchor对于基于Anchor的检测算法的作用是什么？
3D稀疏卷积特征如何变成BEV特征？

好啦，经过一番努力，这四个细节我是弄明白了，那么还有 $N - 4$ 个细节等着我 😃 留着在下一个博客讨论。

2. 理解SA-SSD中的Voxel生成

在car_cfg.py文件中可以看到有关Voxel生成的超参数。

        generator=dict(
            type='VoxelGenerator',
            voxel_size=[0.05, 0.05, 0.1], # 体素小方块的尺寸，长宽 0.05 米， 高 0.1 米
            # 点云范围表示：
            # 表示 x 轴范围是 [0, 70.4]
            # 表示 y 轴范围是 [-40, 40]
            # 表示 z 轴范围是 [-3, 1]
            point_cloud_range=[0, -40., -3., 70.4, 40., 1.],
            max_num_points=5, # 计算一个体素最多需要5个点
            max_voxels=20000  # 体素最大值
        ),

根据我之前一篇博客的分析，Voxel的生成代码写在类KITTILiDAR的初始化中。

self.generator = generator

generator被指定为car_cfg.py中的VoxelGenerator。来看看这个类的初始化代码：

class VoxelGenerator:
    def __init__(self,
                 voxel_size,
                 point_cloud_range,
                 max_num_points,
                 max_voxels=20000):
        point_cloud_range = np.array(point_cloud_range, dtype=np.float32)
        # [0, -40, -3, 70.4, 40, 1]
        voxel_size = np.array(voxel_size, dtype=np.float32)
        # 我用计算器算了一下，grid_size 是 1408*1600*40 
        grid_size = (
            point_cloud_range[3:] - point_cloud_range[:3]) / voxel_size
        # grid_size 取整
        grid_size = np.round(grid_size).astype(np.int64)
        self._voxel_size = voxel_size
        self._point_cloud_range = point_cloud_range
        self._max_num_points = max_num_points
        self._max_voxels = max_voxels
        self._grid_size = grid_size

	# 计算体素
    def generate(self, points):
        return points_to_voxel(
            points, self._voxel_size, self._point_cloud_range,
            self._max_num_points, True, self._max_voxels)

函数points_to_voxel有点复杂，我先读读它的注释，弄懂它的输入输出流：

def points_to_voxel(points, # N*3 的点云
                     voxel_size, # 一个 voxel 的尺寸，长宽 0.05 米， 高 0.1 米
                     coors_range, # [0, -40, -3, 70.4, 40, 1]
                     max_points=35, # 使用时候让 max_points = 5
                     reverse_index=True,
                     max_voxels=20000):
    """convert kitti points(N, >=3) to voxels. This version calculate
    everything in one loop. now it takes only 4.2ms(complete point cloud)
    with jit and 3.2ghz cpu.(don't calculate other features)
    Note: this function in ubuntu seems faster than windows 10.

    Args:
        points: [N, ndim] float tensor. points[:, :3] contain xyz points and
            points[:, 3:] contain other information such as reflectivity.
        voxel_size: [3] list/tuple or array, float. xyz, indicate voxel size
        coors_range: [6] list/tuple or array, float. indicate voxel range.
            format: xyzxyz, minmax
        max_points: int. indicate maximum points contained in a voxel.
        reverse_index: boolean. indicate whether return reversed coordinates.
            if points has xyz format and reverse_index is True, output
            coordinates will be zyx format, but points in features always
            xyz format.
        max_voxels: int. indicate maximum voxels this function create.
            for second, 20000 is a good choice. you should shuffle points
            before call this function because max_voxels may drop some points.

    Returns: （可以读读下面代码，我把所有张量尺寸都标注好了）
        voxels: [M, max_points, ndim] float tensor. only contain points.
        coordinates: [M, 3] int32 tensor.
        num_points_per_voxel: [M] int32 tensor.
    """
    if not isinstance(voxel_size, np.ndarray):
        voxel_size = np.array(voxel_size, dtype=points.dtype)
    if not isinstance(coors_range, np.ndarray):
        coors_range = np.array(coors_range, dtype=points.dtype)
    # voxelmap_shape 就是 [1408，1600，40] 的元组 
    voxelmap_shape = (coors_range[3:] - coors_range[:3]) / voxel_size
    voxelmap_shape = tuple(np.round(voxelmap_shape).astype(np.int32).tolist())
    # 反过来，voxelmap_shape = [40, 1600, 1408]
    if reverse_index:
        voxelmap_shape = voxelmap_shape[::-1]
    # don't create large array in jit(nopython=True) code.
    # num_points_per_voxel 长度是 20000 的元组
    num_points_per_voxel = np.zeros(shape=(max_voxels, ), dtype=np.int32)
    # coor_to_voxelidx 是 40*1600*1408 的张量，里面被 -1 填充
    coor_to_voxelidx = -np.ones(shape=voxelmap_shape, dtype=np.int32)
    # voxels 是 20000*5*3 的张量，3 表示 xyz 坐标， 5 表示一个体素内可容纳最多点的数量
    voxels = np.zeros(
        shape=(max_voxels, max_points, points.shape[-1]), dtype=points.dtype)
    # coors 是 20000*3 的张量，被零填充，表示体素的坐标
    coors = np.zeros(shape=(max_voxels, 3), dtype=np.int32)
    if reverse_index:
    	# 调用 cuda 函数做点云体素化
    	# 调用该函数时，voxels, coors,num_points_per_voxel，都被修改了
    	# 为了方便起见，记 voxel_num 为 V
        voxel_num = _points_to_voxel_reverse_kernel(
            points, voxel_size, coors_range, num_points_per_voxel,
            coor_to_voxelidx, voxels, coors, max_points, max_voxels)

    else:
        voxel_num = _points_to_voxel_kernel(
            points, voxel_size, coors_range, num_points_per_voxel,
            coor_to_voxelidx, voxels, coors, max_points, max_voxels)

    coors = coors[:voxel_num] # V*3 表示所有体素的位置
    voxels = voxels[:voxel_num] # V*5*3 表示所有体素内点的位置
    num_points_per_voxel = num_points_per_voxel[:voxel_num] # 长度是 V 的元组，表示，每一个体素中实际容纳了多少点

	# 下面这行代码被注释掉了，顺带分析一下吧
	# 求解每个体素中，所有容纳点的重心，并把计算结果放到 voxels 中了
	# 这样 voxels 就是 V*5*4 的张量了
    
    # voxels[:, :, -3:] = voxels[:, :, :3] - \
    #     voxels[:, :, :3].sum(axis=1, keepdims=True)/num_points_per_voxel.reshape(-1, 1, 1)
    return voxels, coors, num_points_per_voxel

总而言之，voxels和coor算是弄懂了。

然后再回过头来看，KITTILiDAR中，voxel是具体怎么调用和计算的。代码中如下所示：

        if isinstance(self.generator, VoxelGenerator):
        	# 我擦，被注释掉了，居然没有使用 generate 函数
            #voxels, coordinates, num_points = self.generator.generate(points)
            
            voxel_size = self.generator.voxel_size # voxel 的尺寸，长宽 0.05 米， 高 0.1 米
            pc_range = self.generator.point_cloud_range # [0, -40., -3., 70.4, 40., 1.]
            grid_size = self.generator.grid_size # [1408，1600，40]

            keep = points_op_cpu.points_bound_kernel(points, pc_range[:3], pc_range[3:])
            voxels = points[keep, :] # 保留范围内的点云，是 N*3 的张量
            # 直接做除法然后取整得到 voxel，是 N*3 的张量
            coordinates = ((voxels[:, [2, 1, 0]] - np.array(pc_range[[2,1,0]], dtype=np.float32)) / np.array(
                voxel_size[::-1], dtype=np.float32)).astype(np.int32)
            num_points = np.ones(len(keep)).astype(np.int32) # voxel 数目

            data['voxels'] = DC(to_tensor(voxels.astype(np.float32)))
            data['coordinates'] = DC(to_tensor(coordinates))
            data['num_points'] = DC(to_tensor(num_points))

SA-SSD中计算voxel的方法比较简陋，没有调用正规方法points_to_voxel。也不明白为什么。先放在这吧（狗头）。

3. 理解SA-SSD中的Anchor作用

作为小白，初次接触Anchor还不太了解它的作用。所以很有必要扣一下SA-SSD中相关代码的细节。

3.1 Anchor生成

在car_cfg.py文件中可以看到有关Anchor生成的超参数。因为SA-SSD只训练学习Car类目标，所以Anchor是针对于车单独一类。

        anchor_generator=dict(
            type='AnchorGeneratorStride', # 生成 Anchor 的指定类
            sizes=[1.6, 3.9, 1.56], # 一个Anchor的尺寸，宽1.6米，长3.9米，高1.56米
            anchor_strides=[0.4, 0.4, 1.0],
            anchor_offsets=[0.2, -39.8, -1.78],
            rotations=[0, 1.57], # 只考虑 0度 和 90度，两种Anchor情况
        ),
        anchor_area_threshold=1,
        out_size_factor=8,

根据我之前一篇博客的分析，Anchor的生成代码写在类KITTILiDAR的初始化中。其中，anchor_generator被指定为car_cfg.py中的AnchorGeneratorStride。Python中*的含义是用来接受任意多个参数并将其放在一个元组，可见这篇博客。[::-1]表示取从后向前（相反）的元素，可见这篇博客。[:2]表示取元组中索引为0和1的元素。

        # anchor
        if anchor_generator is not None:
        	# 由第二节讨论，grid_size是 [1408，1600，40]
        	# feature_map_size  应该指 xy 平面上的空间区域，记为 [1408，1600]
            feature_map_size = self.generator.grid_size[:2] // self.out_size_factor
            # [1408，1600] => [1408，1600, 1] => [1, 1600, 1408]
            feature_map_size = [*feature_map_size, 1][::-1]
            # 喂入 [1, 1600, 1408] 生成 anchors
            # 它是 (1, 1600, 1408, 1, 2, 7) 的张量，
            # 2 表示旋转角度类别（ 0 和 90 度），7 表示 Anchor 参数，xyzwlh 以及 Yaw 旋转角
            anchors = anchor_generator(feature_map_size)
            # 7 个参数，分别是 xyzwlh 和 Yaw 旋转角
            # self.anchors 是 （1600*1408*2，7） 的张量
            self.anchors = anchors.reshape([-1, 7])
            # 生成 BEV 视图下的 anchors_bv，仅仅使用 [0, 1, 3, 4, 6]
            # 使用了 xy wl 和 旋转角
            # rbbox2d 输出 [N, 4(xmin, ymin, xmax, ymax)] bboxes
            # self.anchors_bv 是 （1600*1408*2，4） 的张量
            self.anchors_bv = rbbox2d_to_near_bbox(
                self.anchors[:, [0, 1, 3, 4, 6]])
        else:
            self.anchors=None

上述代码核心是anchor_generator(feature_map_size)，我们看看AnchorGeneratorStride是如何生成Anchor吧：

    def __call__(self, feature_map_size):
        return create_anchors_3d_stride(
            feature_map_size, self._sizes, self._anchor_strides,
            self._anchor_offsets, self._rotations, self._dtype)

它调用函数create_anchors_3d_stride（这段代码中间部分没看懂，关注一下该函数输入和输出的张量就行）。np.meshgrid可以参考这篇博文。

def create_anchors_3d_stride(feature_size, # 是 [1, 1600, 1408]
                             sizes=[1.6, 3.9, 1.56], # 单个 Anchor 的长度
                             anchor_strides=[0.4, 0.4, 0.0], # 指每个 Anchor 的间距 cfg 中是 [0.4, 0.4, 1.0],
                             anchor_offsets=[0.2, -39.8, -1.78],
                             rotations=[0, np.pi / 2],
                             dtype=np.float32):
    """
    Args:
        feature_size: list [D, H, W](zyx)
        sizes: [N, 3] list of list or array, size of anchors, xyz

    Returns:
        anchors: [*feature_size, num_sizes, num_rots, 7] tensor.
    """
    # almost 2x faster than v1
    x_stride, y_stride, z_stride = anchor_strides # 分别是 0.4，0.4，1.0
    x_offset, y_offset, z_offset = anchor_offsets # 分别是 0.2，-39.8，-1.78
    z_centers = np.arange(feature_size[0], dtype=dtype) # 生成数组，0
    y_centers = np.arange(feature_size[1], dtype=dtype) # 生成数组，0，1，...,1600-1
    x_centers = np.arange(feature_size[2], dtype=dtype) # 生成数组，0，1，...，1408-1
    
    # 这里算 center 是有问题的，y_centers 可以到 599.8m，实际上雷达测不到这么远
    z_centers = z_centers * z_stride + z_offset # -1.78
    y_centers = y_centers * y_stride + y_offset # -39.8，-39.4，...，599.8
    x_centers = x_centers * x_stride + x_offset # 0.2，0.6，...，563.0
    sizes = np.reshape(np.array(sizes, dtype=dtype), [-1, 3]) # 变成 1*3 张量，如果要生成 N 种 Anchor，就会有 N*3 张量
    rotations = np.array(rotations, dtype=dtype)
    # 生成网格点
    rets = np.meshgrid(
        x_centers, y_centers, z_centers, rotations, indexing='ij')
    tile_shape = [1] * 5 # 等价于 [1,1,1,1,1]
    tile_shape[-2] = int(sizes.shape[0]) # 如果要生成 N 种 Anchor，它等于 [1,1,1,N,1]
    # 大概遍历 1408 次，下面这段代码比较难懂
    for i in range(len(rets)):
        rets[i] = np.tile(rets[i][..., np.newaxis, :], tile_shape)
        rets[i] = rets[i][..., np.newaxis]  # for concat
    sizes = np.reshape(sizes, [1, 1, 1, -1, 1, 3])
    tile_size_shape = list(rets[0].shape)
    tile_size_shape[3] = 1
    sizes = np.tile(sizes, tile_size_shape)
    rets.insert(3, sizes)
    ret = np.concatenate(rets, axis=-1)
    # 输出结果是 (1, 1600, 1408, 1, 2, 7) 的张量
    # 第一维没啥说的
    # 第二维是 anchor 在 y 轴上的序号 0~1600-1
    # 第三维是 anchor 在 x 轴上的序号 0~1408-1
    # 第四维是 anchor 的类别，只生成 car，所以只有这一类
    # 第五维是 anchoe 的转角，只生成了 0 度和 90 度，这两类
    # 第六维是 anchor 的7个，第7个为 Yaw 旋转角，前六个是 xyz 和 wlh
    return np.transpose(ret, [2, 1, 0, 3, 4, 5])

上面这段代码稍微有些难懂。幸好这段代码没什么依赖，可以直接把它截下来，单独跑一跑结果，打印那些你搞不懂的变量。下面是我的调试代码（可以生成多个类别的Anchor）：

def main():
    feature_size = [1, 1600, 1408]
    # 一个类别的 Anchor
    # 生成结果是 (1, 1600, 1408, 1, 2, 7)
    res = create_anchors_3d_stride(feature_size, anchor_strides=[0.4, 0.4, 1.0])
    # 两个类别的 Anchor，不过不同类别的 anchor_strides 是一样的
    # 生成结果是 (1, 1600, 1408, 2, 2, 7)
    # res = create_anchors_3d_stride(feature_size, sizes=[[1.6, 3.9, 1.56],[1.0, 3.0, 2.56]], anchor_strides=[0.4, 0.4, 1.0])
    print("ss: ", res[0][0][0][0][0][:])
    print("ss: ", res[0][0][1000][0][0][:])

if __name__ == "__main__":
    main()

输出结果是七维向量，具体含义在代码中已讲了：

ss:  [  0.2  -39.8   -1.78   1.6    3.9    1.56   0.  ]
ss:  [400.2  -39.8   -1.78   1.6    3.9    1.56   0.  ]

在 $(1000, 0)$ 处的Anchor的坐标分量居然是400.2米，显然是用不到这么远的Anchor。不知道啥情况。也许后续做了些处理。总而言之，Anchor生成之谜算是解决了（狗头）。

3.2 Anchor Mask

在KITTILiDAR类中生成了Anchor，同时也生成了Anchor Mask。考虑到雷达点云是稀疏，尽管Anchor覆盖了整个BEV区域。显然，只有在有点云的地方，才有可能有3d目标。那些没有点云的空洞区域的Anchor是没啥用的。Anchor Mask的作用就是把覆盖点云的Anchor标记出来。来看这一段生成Anchor Mask代码。np.cumsum表示轴上累加，这里用于做某个轴上的离散积分，可参考这篇博客。

# 在 cfg 文件中， self.anchor_area_threshold = 1
if self.anchor_area_threshold >= 0 and self.anchors is not None:
	# coordinates 是 N*3 的张量
	# grid_size 是 [1408，1600，40]
	# tuple(grid_size[::-1][1:]） 是 [1600, 1408] 的元组
	# dense_voxel_map 是 1600*1408 的矩阵，
	# dense_voxel_map[i][j] = a，表示 (i,j) 区域内体素的个数为 a
	# dense_voxel_map 可以看作是体素分布的密度函数
	dense_voxel_map = sparse_sum_for_anchors_mask(
    	coordinates, tuple(grid_size[::-1][1:]))
    # 在第零轴上累加
    dense_voxel_map = dense_voxel_map.cumsum(0)
    # 接着在第一轴上累加，得到 dense_voxel_map，还是 1600*1408 的矩阵
    dense_voxel_map = dense_voxel_map.cumsum(1)
    # 累加操作可以看作是积分，两次累加，相当于在 x 轴和 y 轴做积分
    # 这时候 dense_voxel_map 是一个关于体素的分布函数
    
    # self.anchors_bv 是 BEV 视图下生成的 Anchors，是 （1600*1408*2，5） 的张量
    # voxel_size 是 [0.05, 0.05, 0.1]
    # pc_range 是 [0, -40., -3., 70.4, 40., 1.]
    # grid_size 是 [1408，1600，40]
    # anchors_area 是 1408*1600*2 的向量
    anchors_area = fused_get_anchors_area(
        dense_voxel_map, self.anchors_bv, voxel_size, pc_range, grid_size)
    # anchor_area_threshold = 1，说明只要 Anchor 里面有一个体素，就把它归入 Mask
    # anchors_mask 是 1408*1600*2 的 bool 型向量
    anchors_mask = anchors_area > self.anchor_area_threshold
    data['anchors_mask'] =  DC(to_tensor(anchors_mask.astype(np.uint8)))

代码中的函数sparse_sum_for_anchors_mask如下所示：

# numba是一个用于编译Python数组和数值计算函数的编译器，
# 这个编译器能够大幅提高直接使用Python编写的函数的运算速度。
# shape 是 [1600, 1408] 的元组
@numba.jit(nopython=True)
def sparse_sum_for_anchors_mask(coors, shape):
	# ret 是 1600*1408 的网格，网格单元中的值对应该区域覆盖体素个数
    ret = np.zeros(shape, dtype=np.float32)
    for i in range(coors.shape[0]):
        ret[coors[i, 1], coors[i, 2]] += 1

    return ret

代码中的函数fused_get_anchors_area如下所示：

# dense_map 1600*1408 的离散分布函数（差一个比例因子）
# anchors_bv BEV 视图下的 anchors，是 （1600*1408*2，5） 的张量
# stride 是 voxel size [0.05, 0.05, 0.1]
# offset 是 pc_range 是 [0, -40., -3., 70.4, 40., 1.]
# grid_size 是 [1408，1600，40]
@numba.jit(nopython=True)
def fused_get_anchors_area(dense_map, anchors_bv, stride, offset,
                           grid_size):
    # 初始化为 （1600*1408*2，4） 的零张量
    anchor_coor = np.zeros(anchors_bv.shape[1:], dtype=np.int32)
    grid_size_x = grid_size[0] - 1 # 1408-1
    grid_size_y = grid_size[1] - 1 # 1600-1
    N = anchors_bv.shape[0] # 1600*1408*2
    ret = np.zeros((N), dtype=dense_map.dtype) # 1600*1408*2维的零向量
    for i in range(N):
    	# 把实际坐标转换为体素中的位置
    	# anchors_bv[i, ：4] 代表一个 2D box 的 (xmin, ymin, xmax, ymax)
        anchor_coor[0] = np.floor(
            (anchors_bv[i, 0] - offset[0]) / stride[0])
        anchor_coor[1] = np.floor(
            (anchors_bv[i, 1] - offset[1]) / stride[1])
        anchor_coor[2] = np.floor(
            (anchors_bv[i, 2] - offset[0]) / stride[0])
        anchor_coor[3] = np.floor(
            (anchors_bv[i, 3] - offset[1]) / stride[1])
        # 不能超过体素范围
        anchor_coor[0] = max(anchor_coor[0], 0)
        anchor_coor[1] = max(anchor_coor[1], 0)
        anchor_coor[2] = min(anchor_coor[2], grid_size_x)
        anchor_coor[3] = min(anchor_coor[3], grid_size_y)
        ID = dense_map[anchor_coor[3], anchor_coor[2]] # xmax, ymax
        IA = dense_map[anchor_coor[1], anchor_coor[0]] # xmin, ymin
        IB = dense_map[anchor_coor[3], anchor_coor[0]]
        IC = dense_map[anchor_coor[1], anchor_coor[2]]
        # 从分布函数中，计算这一块区域的概率，概率越高表示体素越多
        # ret[i] 表示区域 x_min<x<x_max,y_min<y<y_max 内有多少个体素
        # F(x_min<x<x_max,y_min<y<y_max) = 
        # F(x_max,y_max) - F(x_max,y_min) - F(x_min,y_max) + F(x_min,y_min)
        ret[i] = ID - IB - IC + IA
    return ret

3.3 Anchor使用

Anchor主要使用在检测器的Head上。前面的Backbone Network都没有使用。

我先分析Anchor在检测器的第一个Head，即SSDRotateHead，上的使用情况。Anchor主要使用在get_guided_anchors这个函数上。get_guided_anchors在整个前向计算图中的位置可以参考我上一篇文章中的图1。Neck输出的BEV特征图，输入到SSDRotateHead中，进行3D框的预测和点云类别分类。初始3D框预测结果会喂入到get_guided_anchors中，输出与初始3D预测框重叠度较高的Anchor Mask中的Anchor，作为“引导Anchor”（Guided Anchor）。

# 有必要解释一下 rpn_outs 的构成
# rpn_outs  = [box_preds, cls_preds, dir_cls_preds]
# dir_cls_preds 是方向分类，分为面向相机，和背对相机两类
# 记 N 是 Batch Size
# box_preds 是一个 [N, y(H), x(W)，C]的张量，C = 7，用7个变量表示一个 box
# cls_preds 是一个 [N, y(H), x(W)，C]的张量，C = num_class，如果只识别车的话，那就一类
# dir_cls_preds 是一个 [N, y(H), x(W)，2]的张量
# y(H), x(W) 是从BEV视图下 y 轴 和 x 轴的坐标分量
# x 轴范围是 0~70.4m，y 轴范围是 -40.0~40.0 （如果你还记得的话）
# 啰嗦一句，H 和 W 可不是什么相机成像面尺寸啥的，别搞错了，H=1408，W=1600，是体素化的范围
guided_anchors = self.rpn_head.get_guided_anchors(*rpn_outs, ret['anchors'], ret['anchors_mask'], ret['gt_bboxes'], thr=0.1)

注释：为啥Neck输出的是BEV特征图？SA-SSD不是输入体素化点云，做稀疏卷积，最后得到体素特征呀？论文中还有一步Reshape操作，把点特征变成BEV特征。这个细节第五节分析。

大概意思是这样的。话不多说上代码（这段代码我没有看的太懂，大致意思以注释的方式写在代码中了）。for in zip()属于并行遍历。

	# anchors_mask 是 （1408*1600*2，1） 的 bool 型向量
	# anchors 是 （1600*1408*2，7） 的张量
	# box_preds, cls_preds, dir_cls_preds 是 [N, H, W，C] 的张量
	# 每个变量的 C 值都不一样，分别是 7， num_class， 2
	# N 是 batch size
    def get_guided_anchors(self, box_preds, cls_preds, dir_cls_preds, anchors, anchors_mask, gt_bboxes, thr=.1):
        batch_size = box_preds.shape[0]

		# batch_box_preds 是 [N, H*W，7] 的张量
        batch_box_preds = box_preds.view(batch_size, -1, self._box_code_size)
        # batch_anchors_mask 是 [N, 1600*1408*2] 的张量
        batch_anchors_mask = anchors_mask.view(batch_size, -1)
        # batch_cls_preds 是 [N, H*W] 的张量，这样写岂不是只能识别一类目标
        # 如果识别多类目标的话，应该是[N, H*W，num_class] 吧
        batch_cls_preds = cls_preds.view(batch_size, -1)
        # second_box_decode 的代码不是特别懂
        batch_box_preds = second_box_decode(batch_box_preds, anchors)

        if self._use_direction_classifier:
            batch_dir_preds = dir_cls_preds.view(batch_size, -1, 2)

        new_boxes = []
        if gt_bboxes is None:
            gt_bboxes = [None] * batch_size

		# zip 打包遍历，感觉是遍历 N 遍，即 batch_size 的次数
        for box_preds, cls_preds, dir_preds, a_mask, gt_boxes in zip(
                batch_box_preds, batch_cls_preds, batch_dir_preds, batch_anchors_mask, gt_bboxes
        ):
        	# 从函数名上理解，这段代码是获取 Guided Anchor，
        	# 这一段代码我看的不是特别懂，但是我知道这一段的意思
        	# 首先，把跟网络初次预测的 3d框 跟 Anchor_mask 下的 Anchor比较
        	#      把重叠度高的 Anchor 保留下来；
        	# 其次，这些 Anchor 对应的网络初次预测的 3d框 所对应的cls_preds 用 sigmoid 处理一遍，
        	#      把高于阈值 thr 的 Anchor 框保留下来
        	# 再者，如果是训练阶段，有 3d框 的真值
        	# 就对每一个 Guided Anchor 贴上一个 3d框 的真值
            box_preds = box_preds[a_mask]
            cls_preds = cls_preds[a_mask]
            dir_preds = dir_preds[a_mask]

            if self._use_direction_classifier:
                dir_labels = torch.max(dir_preds, dim=-1)[1]

            if self._use_sigmoid_cls:
                total_scores = torch.sigmoid(cls_preds)
            else:
                total_scores = F.softmax(cls_preds, dim=-1)[..., 1:]

            top_scores = torch.squeeze(total_scores, -1)

            selected = top_scores > thr

            box_preds = box_preds[selected]

            if self._use_direction_classifier:
                dir_labels = dir_labels[selected]
                opp_labels = (box_preds[..., -1] > 0) ^ dir_labels.byte()
                box_preds[opp_labels, -1] += np.pi

            # add ground-truth
            if gt_boxes is not None:
                box_preds = torch.cat([gt_boxes, box_preds],0)

			# 保存每一个合格的 Anchor
            new_boxes.append(box_preds)
        return new_boxes

3.4 Anchor的作用

我以前不是搞深度学习的，更没搞过什么目标检测。再初次分析代码的时候，读到Anchor，总是不能理解，然后就把Anchor相关的代码跳过去了。深度学习的大框架不难理解，加上这篇文章的代码写得还算清晰，所以前面几篇博客的讨论还算顺利。随着代码阅读的深入，我对Anchor的认识越来越深入。直到我写在这里才算明白。

言归正传。下面是我对Anchor的通俗理解。3d目标检测预测一个目标的7个参数外加目标的类别（共8个参数）。假设我只预测车这一类，那么我需要回归出一个目标的7个参数，即xyzwlh和yaw角。然而网络大多是不靠谱的，它回归出来一堆不太精确的目标。考虑到车这一类有着共性，比如各色型号的车的长宽高都差不多（专指小车），以及车都在地上跑（车中心距离地面的高度差不多一致）。Anchor是作为3d目标的一种先验（Prior），指3d目标可能以某种姿态角度出现的地方。如果我只识别车，我可以生成一堆Anchors，固定它的wlh和z，让它们匀称地分布在BEV视图下。3d目标一定在某个Anchor的附近。给不靠谱网络识别的3D框和这一堆Anchors做类似交集的运算，可以得到一些靠谱的Anchors（即SA-SSD中的引导Anchor），用于做后续处理。

有时候，Anchors的数量太多了。考虑到有点云的地方才会有目标，我们可以扔掉那些自身不覆盖任何点云的Anchor（这是Anchor Mask的工作）。然后对剩下的Acnhor和不靠谱网络生成的3D框做类似交集的运算，可以得到一些靠谱的Anchors（即SA-SSD中的引导Anchor），用于做后续处理。

从上一篇博客的计算图图1和图3可见，SA-SSD并没有使用rpn_outs（即不靠谱网络的输出），而是使用Guided Anchors做最后的3D目标检测精调优化。后续处理就是Extra_Head的事情了，这篇博客讲不完了，留着下篇分析。

4. 稀疏3D特征转换为BEV特征

再说特征转换之前。首先说稀疏卷积。

SA-SSD处理点云用到稀疏卷积。稀疏卷积不难理解，我只对一些细节做个讨论。稀疏卷积可以参考我的一篇博客。作为普通卷积的延伸，稀疏卷积的输入必须是Voxel，因为稀疏卷积的感受野是离散形式的立方块，所以输入点云必须按照体素尺寸离散化成int形式。点云体素化过程可以参考第二节。体素化后的点云是 $H\times W\times D$ 的张量，也是在第二节分析的 $1408\times 1600\times 40$ 尺寸的张量。步长Stride=1和核尺寸为 $3$ 的稀疏卷积输出 $N\times C\times D\times H\times W$ 的张量。 $C$ 是输出通道数。 $N$ 是批处理大小。如果是步长Stride=2的稀疏卷积，输出的 $D, H, W$ 都会缩小。这和普通卷积一样，留意一下就好。

还有一处细节，就是SA-SSD的辅助网络前面一部分需要点云特征输入。需要把体素化的点云特征转变为一般的点云特征，即把体素坐标转化为雷达坐标，这是tensor2points做的事情（就叫反体素过程吧）。对于步长Stride=2的稀疏卷积输出特征，反体素过程中，用到的体素尺寸需要翻倍。代码中是可以看到的。

        x = self.conv0(x)
        x = self.down0(x)  # sp
        x = self.conv1(x)  # 2x sub
        
        if not is_test:
        	# 反体素，之前有 down0 的降采样，用到的体素尺寸翻了一倍
            vx_feat, vx_nxyz = tensor2points(x, voxel_size=(.1, .1, .2))
            p1 = nearest_neighbor_interpolate(points_mean, vx_nxyz, vx_feat)

        x = self.down1(x)
        x = self.conv2(x)

        if not is_test:
        	# 反体素，之前有 down1 的降采样，用到的体素尺寸翻了一倍
            vx_feat, vx_nxyz = tensor2points(x, voxel_size=(.2, .2, .4))
            p2 = nearest_neighbor_interpolate(points_mean, vx_nxyz, vx_feat)

        x = self.down2(x)
        x = self.conv3(x)

        if not is_test:
        	# 反体素，之前有 down2 的降采样，用到的体素尺寸翻了一倍
            vx_feat, vx_nxyz = tensor2points(x, voxel_size=(.4, .4, .8))
            p3 = nearest_neighbor_interpolate(points_mean, vx_nxyz, vx_feat)

        out = self.extra_conv(x)

最后说特征变换，从三维卷积特征变成BEV特征，对应SA-SSD框图中的Reshape。代码如下：

    def forward(self, voxel_features, coors, batch_size, is_test=False):

        points_mean = torch.zeros_like(voxel_features)
        points_mean[:, 0] = coors[:, 0]
        points_mean[:, 1:] = voxel_features[:, :3]

        coors = coors.int()
        x = spconv.SparseConvTensor(voxel_features, coors, self.sparse_shape, batch_size)
        x, point_misc = self.backbone(x, points_mean, is_test)

		# 三维卷积特征变成BEV特征
        x = x.dense()
        N, C, D, H, W = x.shape # N, C, D, H, W 已经讨论啦
        # 其实把 C 和 D 两维合起来就行了
        # C*D 就是 BEV特征的通道数了。
        # BEV特征就像一个图像的特征，H, W表示特征图的尺寸
        x = x.view(N, C * D, H, W) 

        x = self.fcn(x)

        if is_test:
            return x

        return x, point_misc