Lift-Splat-Shoot源码学习

最新推荐文章于 2025-02-17 13:30:50 发布

WhynotLike2021

最新推荐文章于 2025-02-17 13:30:50 发布

阅读量528

点赞数 2

文章标签：学习

本文链接：https://blog.csdn.net/WhynotLike2021/article/details/127965513

版权

文章详细介绍了LiftSplatShoot算法的实现过程，包括数据预处理、模型结构、特征提取和体素编码等关键步骤，以及如何使用eval_model_iou函数进行模型评估。在模型中，EfficientNet作为backbone，用于提取图像特征，然后转换为BEV表示并进行预测。同时，文章提到了模型训练过程中可能出现的问题，如iou为0的情况，以及可能的解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

先上个别人解读的链接🔗：

https://zhuanlan.zhihu.com/p/567880155

https://blog.csdn.net/weixin_41803339/article/details/127140039?spm=1001.2014.3001.5502

再上个别人讲解的视频链接:

https://www.bilibili.com/video/av470798830/?vd_source=ff498e5dc05e7bbe6be82c1d9e17f9fa

1. 利用explore.py中的eval_model_iou函数进行debug，这里记录一下函数传参的问题：

python main.py eval_model_iou mini/trainval --modelf=MODEL_LOCATION --dataroot=NUSCENES_ROOT

version版本选择mini or trainval，再加上下载好的预训练模型路径和nuscenes整个文件夹的路径。

在dataroot的基础上选择version，文件夹参考如下：

nuscenes
|---mini
| |---maps
| |---samples
| |---sweeps
| |---v1.0-mini
|---trainval
| |---maps
| |---samples
| |---sweeps
| |---v1.0-traval

2. 捋算法流程：

debug, 利用:

eval_model_iou(version="mini", modelf="./model525000.pt", dataroot="../data/nuScenes", gpuid=0)


def eval_model_iou(version,
                modelf,
                dataroot='/data/nuscenes',
                gpuid=1,

                H=900, W=1600,
                resize_lim=(0.193, 0.225),
                final_dim=(128, 352),
                bot_pct_lim=(0.0, 0.22),
                rot_lim=(-5.4, 5.4),
                rand_flip=True,

                xbound=[-50.0, 50.0, 0.5],
                ybound=[-50.0, 50.0, 0.5],
                zbound=[-10.0, 10.0, 20.0],
                dbound=[4.0, 45.0, 1.0],  # 决定了后面的D=41

                bsz=4,
                nworkers=10,
                ):
    grid_conf = {
        'xbound': xbound,
        'ybound': ybound,
        'zbound': zbound,
        'dbound': dbound,
    }
    data_aug_conf = {
                    'resize_lim': resize_lim,
                    'final_dim': final_dim,
                    'rot_lim': rot_lim,
                    'H': H, 'W': W,
                    'rand_flip': rand_flip,
                    'bot_pct_lim': bot_pct_lim,
                    'cams': ['CAM_FRONT_LEFT', 'CAM_FRONT', 'CAM_FRONT_RIGHT',
                             'CAM_BACK_LEFT', 'CAM_BACK', 'CAM_BACK_RIGHT'],
                    'Ncams': 5,  # 这里有个问题,为什么是5呢
                }
    trainloader, valloader = compile_data(version, dataroot, data_aug_conf=data_aug_conf,
                                          grid_conf=grid_conf, bsz=bsz, nworkers=nworkers,
                                          parser_name='segmentationdata')

    device = torch.device('cpu') if gpuid < 0 else torch.device(f'cuda:{gpuid}')

    model = compile_model(grid_conf, data_aug_conf, outC=1)
    print('loading', modelf)
    model.load_state_dict(torch.load(modelf))
    model.to(device)

    loss_fn = SimpleLoss(1.0).cuda(gpuid)

    model.eval()
    val_info = get_val_info(model, valloader, loss_fn, device)
    print(val_info)

再看一些关键方法之前,先看一下模型结构:

class LiftSplatShoot(nn.Module):
    def __init__(self, grid_conf, data_aug_conf, outC):
        super(LiftSplatShoot, self).__init__()
        self.grid_conf = grid_conf
        self.data_aug_conf = data_aug_conf

        dx, bx, nx = gen_dx_bx(self.grid_conf['xbound'],
                                              self.grid_conf['ybound'],
                                              self.grid_conf['zbound'],
                                              )
        self.dx = nn.Parameter(dx, requires_grad=False)
        self.bx = nn.Parameter(bx, requires_grad=False)
        self.nx = nn.Parameter(nx, requires_grad=False)

        self.downsample = 16
        self.camC = 64
        self.frustum = self.create_frustum()
        self.D, _, _, _ = self.frustum.shape  # 这里的D就是后面的41
        self.camencode = CamEncode(self.D, self.camC, self.downsample)
        self.bevencode = BevEncode(inC=self.camC, outC=outC)

        # toggle using QuickCumsum vs. autograd
        self.use_quickcumsum = True

2.1. 创建视锥的函数

先看看创建视锥的函数

    def create_frustum(self):
        # make grid in image plane
        # ['final_dim'] = (128, 352)
        ogfH, ogfW = self.data_aug_conf['final_dim']

        # self.downsample = 16, fH:8, fW:22
        fH, fW = ogfH // self.downsample, ogfW // self.downsample
        ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)  # 输出是(41, 8, 22)

        # D:41
        D, _, _ = ds.shape
        xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW)  # 输出是(41, 8, 22)
        ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW)  # 输出是(41, 8, 22)

        # D x H x W x 3
        frustum = torch.stack((xs, ys, ds), -1)  # 输出是(41, 8, 22, 3)
        return nn.Parameter(frustum, requires_grad=False)

torch.linspace()函数返回一个一维的tensor

torch.stack()将xs,ys,ds在拼接,3个(41, 8, 22)-->(41, 8, 22, 3)

最后一维就是x, y, d,即对应的(D, H, W)在相机坐标系下的位置(以相机为中心)

2.2 得到ego车为中心坐标的体素

get_voxels()：

得到最后的BEV体素输出，然后将其作为bevencode的输入（感觉这里的bevencode其实就是解码器，因为它的输出outC=1）。

get_voxels()包含有三个重要的函数：

    def get_voxels(self, x, rots, trans, intrins, post_rots, post_trans):

        # 这些传入的参数都是数据加载之后返回的
        geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans)  # 输出是(4, 6, 41, 8, 22, 3)
        x = self.get_cam_feats(x)  # 输出是(4, 6, 41, 8, 22, 64)

        x = self.voxel_pooling(geom, x)

        return x

get_geometry():

得到与get_cam_feats输出升维之后的“特征点云”对应的索引：(很多都是这样介绍的,但这里有点迷糊)

另外,这里也涉及到了一些因为数据增强的操作,后面学一下

    def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
        """Determine the (x,y,z) locations (in the ego frame)
        of the points in the point cloud.
        Returns B x N x D x H/downsample x W/downsample x 3
        """
        B, N, _ = trans.shape

        # undo post-transformation
        # B x N x D x H x W x 3
        # 因为做了数据增广,所以这里要先减去post_trans,然后乘上旋转平移量
        points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
        points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))

        # cam_to_ego
        # 这个就是坐标转换,看着很长,其实主要就是最后一个维度
        # 先取出points[:, :, :, :, :, :2],就是x,y
        # 再取出points[:, :, :, :, :, 2:3],就是d,也可以理解成lamda
        # 这里其实就是将第五个维度,即最后一个维度做了操作 (x, y, d) --> (x*d, y*d, d)
        points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                            points[:, :, :, :, :, 2:3]
                            ), 5)  # 输出是(4, 6, 41, 8, 22, 3, 1)

        # 2D -> 3D 转换矩阵
        combine = rots.matmul(torch.inverse(intrins))  # 输出是(4, 6, 3, 3)

        # 转换后的索引表
        # 先将combine扩充到(B, N, 1, 1, 1, 3, 3),即(4, 6, 1, 1, 1, 3, 3)
        # 再将points最后一个维度压缩,变成最开始(4, 6, 41, 8, 22, 3)
        # 最后进行matmul()
        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)

        # 加上偏移量  trans.shape=(4, 6, 3)
        points += trans.view(B, N, 1, 1, 1, 3)

        return points

最后输出一个(B, N, D, H, W, 3) --> (4, 6, 41, 8, 22, 3)

最后一维就是X,Y,Z,即对应(B, N, D, H, W)在世界坐标系下的位置(ego车为中心)

get_cam_feats()：

    def get_cam_feats(self, x):
        """Return B x N x D x H/downsample x W/downsample x C
        """
        # B:4, N:6, C:3, imH:128, imW:352
        B, N, C, imH, imW = x.shape

        # 将x变换形状 B*N=24
        x = x.view(B*N, C, imH, imW)  # 输出是(24, 3, 128, 352)

        # 重点关注这个编码器
        x = self.camencode(x)  # 输出是(24, 64, 41, 8, 22)

        # 将B*N拆开,重新变成两个维度
        x = x.view(B, N, self.camC, self.D, imH//self.downsample, imW//self.downsample)

        # 将C放到最后一个维度上(这里是为什么,现在还不知道,11.02)
        x = x.permute(0, 1, 3, 4, 5, 2)

        return x

得到图像特征,最后返回一个(B, N, D, H, W, C),即(4, 6, 41, 8, 22, 64)

这个函数的重点在self.camencode(x),是CamEncode类的实例化

这个类的关键函数如下:

D是frustum的第一个维度,论文里设定是4m到44m,以1m为间隔离散化,所以D = 41,而C = camC = 64(这个应该也是人为设定的)

    def get_depth_dist(self, x, eps=1e-20):
        return x.softmax(dim=1)

    def get_depth_feat(self, x):
        # 基本的特征提取网络,用的是EfficientNet
        x = self.get_eff_depth(x)  # 输出是(24, 512, 8, 22)

        # Depth
        # self.depthnet = nn.Conv2d(512, self.D + self.C, kernel_size=1, padding=0)
        # D和C就是lift的关键
        x = self.depthnet(x)  # 输出是(24, 105, 8, 22)

        # 将第二个维度上前D个进行softmax
        depth = self.get_depth_dist(x[:, :self.D])  # 输出是(24, 41, 8, 22)

        # x[:, self.D:(self.D + self.C)]
        # 就是在第二个维度上分开D和C,depth是D
        # *就是相同形状的矩阵,对应元素相乘 
        # 如果形状不相同,在缺少的维度上进行广播,例如(1, 41) * (64, 1) --> (64, 41) * (64, 41)
        new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)  # 输出是(24, 64, 41, 8, 22)

        return depth, new_x

    def forward(self, x):
        depth, x = self.get_depth_feat(x)

        return x

这个new_x就是做完外积操作之后得到的,输出为(24, 64, 41, 8, 22),对应(B*N, C, D, H, W)

最后将其作为x返回

因为下采样倍数是16,所以这里的深度分布预测,单个深度对应的是16个像素(高度压缩的特征图),这里其实是用点问题的(具体还需要再分析,先记录一下这个问题)

get_eff_depth():

这个函数就是先经过backbone提取特征, 再进行上采样之后合并特征, 最后返回特征图

    def get_eff_depth(self, x):
        # adapted from https://github.com/lukemelas/EfficientNet-PyTorch/blob/master/efficientnet_pytorch/model.py#L231
        endpoints = dict()  # 用来存放特征图长宽发生变化之前的特征图,以便后面进行上采样

        # Stem
        # 先将输入图像进行简单卷积归一激活操作,将大小从(32, 3, 128, 352)缩放到(24, 32, 64, 176)
        x = self.trunk._swish(self.trunk._bn0(self.trunk._conv_stem(x)))
        prev_x = x

        # Blocks
        for idx, block in enumerate(self.trunk._blocks):
            drop_connect_rate = self.trunk._global_params.drop_connect_rate
            if drop_connect_rate:
                drop_connect_rate *= float(idx) / len(self.trunk._blocks) # scale drop connect_rate
            x = block(x, drop_connect_rate=drop_connect_rate)
            if prev_x.size(2) > x.size(2):
                endpoints['reduction_{}'.format(len(endpoints)+1)] = prev_x
            prev_x = x

        # Head
        # Blocks最后的x输出是(24, 320, 4, 11)
        endpoints['reduction_{}'.format(len(endpoints)+1)] = x
        # 'reduction_5' == (24, 320, 4, 11)
        # 'reduction_4' == (24, 112, 8, 22)
        # 这里就是先将(24, )
        x = self.up1(endpoints['reduction_5'], endpoints['reduction_4'])  # 输出是(24, 512, 8, 22)
        return x


class Up(nn.Module):
    def __init__(self, in_channels, out_channels, scale_factor=2):
        super().__init__()

        self.up = nn.Upsample(scale_factor=scale_factor, mode='bilinear',
                              align_corners=True)

        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x1, x2):
        x1 = self.up(x1)  # 输出是(24, 320, 8, 22)
        x1 = torch.cat([x2, x1], dim=1)  # 输出是(24, 432, 8, 22)
        return self.conv(x1)  # 返回是(24, 512, 8, 22)

voxel_pooling():

splat操作,论文的精髓

先将除了C之外的所有维度相乘,展平成一个(Nprime, C),即(173184, 64)

geom_feats也展平成一个类似的矩阵(B, N, D, H, W, 3) --> (4, 6, 41, 8, 22, 3) --> (173184, 3)

    def voxel_pooling(self, geom_feats, x):
        # geom_feats等于传入的geom,就是get_geometry()函数的返回值
        # x就是get_cam_feats()函数的返回值,即(4, 6, 41, 8, 22, 64)
        B, N, D, H, W, C = x.shape
        Nprime = B*N*D*H*W

        # flatten x
        x = x.reshape(Nprime, C)  # 输出是(173184, 64)

        # flatten indices
        # bx就是grid_conf['ybound'],dx就是grid_conf['xbound'],
        # long()转换为长整型,这个操作是为什么
        geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long()  # 输出是(4,6, 41, 8, 22, 3)

        geom_feats = geom_feats.view(Nprime, 3)  # 输出是(173184, 3)

        # 因为batch这个维度也被展平了,所以需要把它拿出来
        batch_ix = torch.cat([torch.full([Nprime//B, 1], ix,
                             device=x.device, dtype=torch.long) for ix in range(B)])  # 输出是(173184, 1)

        # 再将这个索引拼接回去
        geom_feats = torch.cat((geom_feats, batch_ix), 1)  # 输出是(173184, 4)

        # filter out points that are outside box
        # nx就是grid_conf['zbound'],这里就是过滤box外面的点,在z轴即垂直方向上
        kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
            & (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
            & (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2])  # 输出是(173184,)
        x = x[kept]  # 输出是(168648, 64)
        geom_feats = geom_feats[kept]  # 输出是(168648, 4)

        # get tensors from the same voxel next to each other
        # 将相同位置的点特征进行合并
        # 这个操作就是保证完全相同的点,计算得到的rank才是相同的,后续其对应的特征才会被叠在一起
        ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
            + geom_feats[:, 1] * (self.nx[2] * B)\
            + geom_feats[:, 2] * B\
            + geom_feats[:, 3]  # 输出是(168648,)
        sorts = ranks.argsort()  # 输出是(168648,)
        x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]

        # cumsum trick
        # 这个操作就是根据计算得到的ranks,将相同rank的特征叠加到一起
        if not self.use_quickcumsum:
            x, geom_feats = cumsum_trick(x, geom_feats, ranks)
        else:
            x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)

        # griddify (B x C x Z x X x Y)
        # 根据索引将特征一一对应到世界坐标系下
        final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)  # 输出是(4, 64, 1, 200, 200)
        final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x

        # collapse Z
        # Z轴上压缩,Z轴方向上其实就一个体素网格
        # (B * C * Z * X * Y) --> (B * C * X * Y)
        final = torch.cat(final.unbind(dim=2), 1)  # 输出是(4, 64, 200, 200)

        return final

argsort()对数据的大小进行排序,并且返回索引,后面再用sorts接收返回的索引,利用这个索引对x, geom_feats, ranks重新进行排序

再看cumsum_trick()函数:

这里可以看最上面的链接解读,有比较详细的说法

def cumsum_trick(x, geom_feats, ranks):
    x = x.cumsum(0)
    kept = torch.ones(x.shape[0], device=x.device, dtype=torch.bool)
    kept[:-1] = (ranks[1:] != ranks[:-1])

    x, geom_feats = x[kept], geom_feats[kept]
    x = torch.cat((x[:1], x[1:] - x[:-1]))

    return x, geom_feats

2.2 self.bevencode():

将self.get_voxel()得到的输出进行编码?感觉说是解码更合适

class BevEncode(nn.Module):
    def __init__(self, inC, outC):
        super(BevEncode, self).__init__()
        # inC=64, outC=1
        trunk = resnet18(pretrained=False, zero_init_residual=True)
        self.conv1 = nn.Conv2d(inC, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = trunk.bn1
        self.relu = trunk.relu

        self.layer1 = trunk.layer1  # 64->64
        self.layer2 = trunk.layer2  # 64->128
        self.layer3 = trunk.layer3  # 128->256

        self.up1 = Up(64+256, 256, scale_factor=4)
        self.up2 = nn.Sequential(
            nn.Upsample(scale_factor=2, mode='bilinear',
                              align_corners=True),
            nn.Conv2d(256, 128, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, outC, kernel_size=1, padding=0),
        )

    def forward(self, x):
        # x={Tensor:(4, 64, 200, 200)}

        x = self.conv1(x)  # 输出是(4, 64, 100, 100)
        x = self.bn1(x)
        x = self.relu(x)

        x1 = self.layer1(x)  # 输出是(4, 64, 100, 100)
        x = self.layer2(x1)  # 输出是(4, 128, 50, 50)
        x = self.layer3(x)  # 输出是(4, 256, 25, 25)

        x = self.up1(x, x1)  # 输出是(4, 256, 100, 100)
        x = self.up2(x)  # 输出是(4, 1, 200, 200)

        return x

基本上整个算法模型到这就完了

输出的预测是(4, 1, 200, 200)

2.3 得到输出之后,后处理部分

def get_val_info(model, valloader, loss_fn, device, use_tqdm=False):
    model.eval()
    total_loss = 0.0
    total_intersect = 0.0
    total_union = 0
    print('running eval...')
    loader = tqdm(valloader) if use_tqdm else valloader
    with torch.no_grad():
        for batch in loader:
            allimgs, rots, trans, intrins, post_rots, post_trans, binimgs = batch
            preds = model(allimgs.to(device), rots.to(device),
                          trans.to(device), intrins.to(device), post_rots.to(device),
                          post_trans.to(device))  # 输出是(4, 1, 200, 200)
            binimgs = binimgs.to(device)  # 输出是(4, 1, 200, 200)

            # loss
            total_loss += loss_fn(preds, binimgs).item() * preds.shape[0]

            # iou
            intersect, union, _ = get_batch_iou(preds, binimgs)
            total_intersect += intersect
            total_union += union

    model.train()
    return {
            'loss': total_loss / len(valloader.dataset),
            'iou': total_intersect / total_union,
            }

得到iou,用来计算平均iou

def get_batch_iou(preds, binimgs):
    """Assumes preds has NOT been sigmoided yet
    """
    with torch.no_grad():
        pred = (preds > 0)  # 输出是(4, 1, 200, 200)
        tgt = binimgs.bool()
        intersect = (pred & tgt).sum().float().item()
        union = (pred | tgt).sum().float().item()
    return intersect, union, intersect / union if (union > 0) else 1.0

2.结果复现和模型修改

1.结果复现

利用脚本进行训练, 未对train.py的传参进行修改

from src.train import train
train(version="trainval", dataroot="../data/nuScenes", gpuid=0)

同时启用tensorboard进行观察

tensorboard --logdir=./runs --bind_all

发现训练基本正常:

随着训练的推进, iou逐步增长, 基本10000次迭代之后, iou达到0.25左右

到这代码基本跑通

2.模型修改

1.修改backbone

采用ConvNeXt-tiny替换EfficientNet

核心问题:

训练的时候, 有loss, 但iou为0

此问题以解决

在这里, 记录一下碰到问题细节:

1. LSS模型中的特征提取器, 主要就是涉及一个backbone提取特征, 然后将最后一层的输出进行上采样, 然后与倒数第二层的输出进行连接, 最后进行通道调整, 这几个步骤的具体修改和问题, 后面再补充

2. 在修改backbone之后, 利用上面结果复现的做法进行训练, 发现打印的loss基本正常, 但iou始终为0

目前初步判断是特征提取时的softmax和backbone本身的归一化冲突

通过注释softmax, 发现并不是这个问题

通过20个小时的训练, 单卡3060, 在batch_size=4的情况下, 接近30000次迭代的时候, iou开始增长, 20个小时大概跑了180000个迭代

3. Ncams=5, 这个传参存在疑问, 目前不记得是不是论文中有提到随机mask掉一个相机进行训练(这部分的代码如下, 看这个代码, 好像确实是这样)---src/data.py#L195-L201

    def choose_cams(self):
        if self.is_train and self.data_aug_conf['Ncams'] < len(self.data_aug_conf['cams']):
            cams = np.random.choice(self.data_aug_conf['cams'], self.data_aug_conf['Ncams'],
                                    replace=False)
        else:
            cams = self.data_aug_conf['cams']
        return cams