先上个别人解读的链接🔗:
https://zhuanlan.zhihu.com/p/567880155
https://blog.csdn.net/weixin_41803339/article/details/127140039?spm=1001.2014.3001.5502
再上个别人讲解的视频链接:
https://www.bilibili.com/video/av470798830/?vd_source=ff498e5dc05e7bbe6be82c1d9e17f9fa
1. 利用explore.py中的eval_model_iou函数进行debug,这里记录一下函数传参的问题:
python main.py eval_model_iou mini/trainval --modelf=MODEL_LOCATION --dataroot=NUSCENES_ROOT
version版本选择mini or trainval,再加上下载好的预训练模型路径和nuscenes整个文件夹的路径。
在dataroot的基础上选择version,文件夹参考如下:
nuscenes
|---mini
| |---maps
| |---samples
| |---sweeps
| |---v1.0-mini
|---trainval
| |---maps
| |---samples
| |---sweeps
| |---v1.0-traval
2. 捋算法流程:
debug, 利用:
eval_model_iou(version="mini", modelf="./model525000.pt", dataroot="../data/nuScenes", gpuid=0)
def eval_model_iou(version,
modelf,
dataroot='/data/nuscenes',
gpuid=1,
H=900, W=1600,
resize_lim=(0.193, 0.225),
final_dim=(128, 352),
bot_pct_lim=(0.0, 0.22),
rot_lim=(-5.4, 5.4),
rand_flip=True,
xbound=[-50.0, 50.0, 0.5],
ybound=[-50.0, 50.0, 0.5],
zbound=[-10.0, 10.0, 20.0],
dbound=[4.0, 45.0, 1.0], # 决定了后面的D=41
bsz=4,
nworkers=10,
):
grid_conf = {
'xbound': xbound,
'ybound': ybound,
'zbound': zbound,
'dbound': dbound,
}
data_aug_conf = {
'resize_lim': resize_lim,
'final_dim': final_dim,
'rot_lim': rot_lim,
'H': H, 'W': W,
'rand_flip': rand_flip,
'bot_pct_lim': bot_pct_lim,
'cams': ['CAM_FRONT_LEFT', 'CAM_FRONT', 'CAM_FRONT_RIGHT',
'CAM_BACK_LEFT', 'CAM_BACK', 'CAM_BACK_RIGHT'],
'Ncams': 5, # 这里有个问题,为什么是5呢
}
trainloader, valloader = compile_data(version, dataroot, data_aug_conf=data_aug_conf,
grid_conf=grid_conf, bsz=bsz, nworkers=nworkers,
parser_name='segmentationdata')
device = torch.device('cpu') if gpuid < 0 else torch.device(f'cuda:{gpuid}')
model = compile_model(grid_conf, data_aug_conf, outC=1)
print('loading', modelf)
model.load_state_dict(torch.load(modelf))
model.to(device)
loss_fn = SimpleLoss(1.0).cuda(gpuid)
model.eval()
val_info = get_val_info(model, valloader, loss_fn, device)
print(val_info)
再看一些关键方法之前,先看一下模型结构:
class LiftSplatShoot(nn.Module):
def __init__(self, grid_conf, data_aug_conf, outC):
super(LiftSplatShoot, self).__init__()
self.grid_conf = grid_conf
self.data_aug_conf = data_aug_conf
dx, bx, nx = gen_dx_bx(self.grid_conf['xbound'],
self.grid_conf['ybound'],
self.grid_conf['zbound'],
)
self.dx = nn.Parameter(dx, requires_grad=False)
self.bx = nn.Parameter(bx, requires_grad=False)
self.nx = nn.Parameter(nx, requires_grad=False)
self.downsample = 16
self.camC = 64
self.frustum = self.create_frustum()
self.D, _, _, _ = self.frustum.shape # 这里的D就是后面的41
self.camencode = CamEncode(self.D, self.camC, self.downsample)
self.bevencode = BevEncode(inC=self.camC, outC=outC)
# toggle using QuickCumsum vs. autograd
self.use_quickcumsum = True
2.1. 创建视锥的函数
先看看创建视锥的函数
def create_frustum(self):
# make grid in image plane
# ['final_dim'] = (128, 352)
ogfH, ogfW = self.data_aug_conf['final_dim']
# self.downsample = 16, fH:8, fW:22
fH, fW = ogfH // self.downsample, ogfW // self.downsample
ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW) # 输出是(41, 8, 22)
# D:41
D, _, _ = ds.shape
xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW) # 输出是(41, 8, 22)
ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW) # 输出是(41, 8, 22)
# D x H x W x 3
frustum = torch.stack((xs, ys, ds), -1) # 输出是(41, 8, 22, 3)
return nn.Parameter(frustum, requires_grad=False)
torch.linspace()函数返回一个一维的tensor
torch.stack()将xs,ys,ds在拼接,3个(41, 8, 22)-->(41, 8, 22, 3)
最后一维就是x, y, d,即对应的(D, H, W)在相机坐标系下的位置(以相机为中心)
2.2 得到ego车为中心坐标的体素
get_voxels():
得到最后的BEV体素输出,然后将其作为bevencode的输入(感觉这里的bevencode其实就是解码器,因为它的输出outC=1)。
get_voxels()包含有三个重要的函数:
def get_voxels(self, x, rots, trans, intrins, post_rots, post_trans):
# 这些传入的参数都是数据加载之后返回的
geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans) # 输出是(4, 6, 41, 8, 22, 3)
x = self.get_cam_feats(x) # 输出是(4, 6, 41, 8, 22, 64)
x = self.voxel_pooling(geom, x)
return x
get_geometry():
得到与get_cam_feats输出升维之后的“特征点云”对应的索引:(很多都是这样介绍的,但这里有点迷糊)
另外,这里也涉及到了一些因为数据增强的操作,后面学一下
def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
"""Determine the (x,y,z) locations (in the ego frame)
of the points in the point cloud.
Returns B x N x D x H/downsample x W/downsample x 3
"""
B, N, _ = trans.shape
# undo post-transformation
# B x N x D x H x W x 3
# 因为做了数据增广,所以这里要先减去post_trans,然后乘上旋转平移量
points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))
# cam_to_ego
# 这个就是坐标转换,看着很长,其实主要就是最后一个维度
# 先取出points[:, :, :, :, :, :2],就是x,y
# 再取出points[:, :, :, :, :, 2:3],就是d,也可以理解成lamda
# 这里其实就是将第五个维度,即最后一个维度做了操作 (x, y, d) --> (x*d, y*d, d)
points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
points[:, :, :, :, :, 2:3]
), 5) # 输出是(4, 6, 41, 8, 22, 3, 1)
# 2D -> 3D 转换矩阵
combine = rots.matmul(torch.inverse(intrins)) # 输出是(4, 6, 3, 3)
# 转换后的索引表
# 先将combine扩充到(B, N, 1, 1, 1, 3, 3),即(4, 6, 1, 1, 1, 3, 3)
# 再将points最后一个维度压缩,变成最开始(4, 6, 41, 8, 22, 3)
# 最后进行matmul()
points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
# 加上偏移量 trans.shape=(4, 6, 3)
points += trans.view(B, N, 1, 1, 1, 3)
return points
最后输出一个(B, N, D, H, W, 3) --> (4, 6, 41, 8, 22, 3)
最后一维就是X,Y,Z,即对应(B, N, D, H, W)在世界坐标系下的位置(ego车为中心)
get_cam_feats():
def get_cam_feats(self, x):
"""Return B x N x D x H/downsample x W/downsample x C
"""
# B:4, N:6, C:3, imH:128, imW:352
B, N, C, imH, imW = x.shape
# 将x变换形状 B*N=24
x = x.view(B*N, C, imH, imW) # 输出是(24, 3, 128, 352)
# 重点关注这个编码器
x = self.camencode(x) # 输出是(24, 64, 41, 8, 22)
# 将B*N拆开,重新变成两个维度
x = x.view(B, N, self.camC, self.D, imH//self.downsample, imW//self.downsample)
# 将C放到最后一个维度上(这里是为什么,现在还不知道,11.02)
x = x.permute(0, 1, 3, 4, 5, 2)
return x
得到图像特征,最后返回一个(B, N, D, H, W, C),即(4, 6, 41, 8, 22, 64)
这个函数的重点在self.camencode(x),是CamEncode类的实例化
这个类的关键函数如下:
D是frustum的第一个维度,论文里设定是4m到44m,以1m为间隔离散化,所以D = 41,而C = camC = 64(这个应该也是人为设定的)
def get_depth_dist(self, x, eps=1e-20):
return x.softmax(dim=1)
def get_depth_feat(self, x):
# 基本的特征提取网络,用的是EfficientNet
x = self.get_eff_depth(x) # 输出是(24, 512, 8, 22)
# Depth
# self.depthnet = nn.Conv2d(512, self.D + self.C, kernel_size=1, padding=0)
# D和C就是lift的关键
x = self.depthnet(x) # 输出是(24, 105, 8, 22)
# 将第二个维度上前D个进行softmax
depth = self.get_depth_dist(x[:, :self.D]) # 输出是(24, 41, 8, 22)
# x[:, self.D:(self.D + self.C)]
# 就是在第二个维度上分开D和C,depth是D
# *就是相同形状的矩阵,对应元素相乘
# 如果形状不相同,在缺少的维度上进行广播,例如(1, 41) * (64, 1) --> (64, 41) * (64, 41)
new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2) # 输出是(24, 64, 41, 8, 22)
return depth, new_x
def forward(self, x):
depth, x = self.get_depth_feat(x)
return x
这个new_x就是做完外积操作之后得到的,输出为(24, 64, 41, 8, 22),对应(B*N, C, D, H, W)
最后将其作为x返回
因为下采样倍数是16,所以这里的深度分布预测,单个深度对应的是16个像素(高度压缩的特征图),这里其实是用点问题的(具体还需要再分析,先记录一下这个问题)
get_eff_depth():
这个函数就是先经过backbone提取特征, 再进行上采样之后合并特征, 最后返回特征图
def get_eff_depth(self, x):
# adapted from https://github.com/lukemelas/EfficientNet-PyTorch/blob/master/efficientnet_pytorch/model.py#L231
endpoints = dict() # 用来存放特征图长宽发生变化之前的特征图,以便后面进行上采样
# Stem
# 先将输入图像进行简单卷积归一激活操作,将大小从(32, 3, 128, 352)缩放到(24, 32, 64, 176)
x = self.trunk._swish(self.trunk._bn0(self.trunk._conv_stem(x)))
prev_x = x
# Blocks
for idx, block in enumerate(self.trunk._blocks):
drop_connect_rate = self.trunk._global_params.drop_connect_rate
if drop_connect_rate:
drop_connect_rate *= float(idx) / len(self.trunk._blocks) # scale drop connect_rate
x = block(x, drop_connect_rate=drop_connect_rate)
if prev_x.size(2) > x.size(2):
endpoints['reduction_{}'.format(len(endpoints)+1)] = prev_x
prev_x = x
# Head
# Blocks最后的x输出是(24, 320, 4, 11)
endpoints['reduction_{}'.format(len(endpoints)+1)] = x
# 'reduction_5' == (24, 320, 4, 11)
# 'reduction_4' == (24, 112, 8, 22)
# 这里就是先将(24, )
x = self.up1(endpoints['reduction_5'], endpoints['reduction_4']) # 输出是(24, 512, 8, 22)
return x
class Up(nn.Module):
def __init__(self, in_channels, out_channels, scale_factor=2):
super().__init__()
self.up = nn.Upsample(scale_factor=scale_factor, mode='bilinear',
align_corners=True)
self.conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x1, x2):
x1 = self.up(x1) # 输出是(24, 320, 8, 22)
x1 = torch.cat([x2, x1], dim=1) # 输出是(24, 432, 8, 22)
return self.conv(x1) # 返回是(24, 512, 8, 22)
voxel_pooling():
splat操作,论文的精髓
先将除了C之外的所有维度相乘,展平成一个(Nprime, C),即(173184, 64)
geom_feats也展平成一个类似的矩阵(B, N, D, H, W, 3) --> (4, 6, 41, 8, 22, 3) --> (173184, 3)
def voxel_pooling(self, geom_feats, x):
# geom_feats等于传入的geom,就是get_geometry()函数的返回值
# x就是get_cam_feats()函数的返回值,即(4, 6, 41, 8, 22, 64)
B, N, D, H, W, C = x.shape
Nprime = B*N*D*H*W
# flatten x
x = x.reshape(Nprime, C) # 输出是(173184, 64)
# flatten indices
# bx就是grid_conf['ybound'],dx就是grid_conf['xbound'],
# long()转换为长整型,这个操作是为什么
geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long() # 输出是(4,6, 41, 8, 22, 3)
geom_feats = geom_feats.view(Nprime, 3) # 输出是(173184, 3)
# 因为batch这个维度也被展平了,所以需要把它拿出来
batch_ix = torch.cat([torch.full([Nprime//B, 1], ix,
device=x.device, dtype=torch.long) for ix in range(B)]) # 输出是(173184, 1)
# 再将这个索引拼接回去
geom_feats = torch.cat((geom_feats, batch_ix), 1) # 输出是(173184, 4)
# filter out points that are outside box
# nx就是grid_conf['zbound'],这里就是过滤box外面的点,在z轴即垂直方向上
kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
& (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
& (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2]) # 输出是(173184,)
x = x[kept] # 输出是(168648, 64)
geom_feats = geom_feats[kept] # 输出是(168648, 4)
# get tensors from the same voxel next to each other
# 将相同位置的点特征进行合并
# 这个操作就是保证完全相同的点,计算得到的rank才是相同的,后续其对应的特征才会被叠在一起
ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
+ geom_feats[:, 1] * (self.nx[2] * B)\
+ geom_feats[:, 2] * B\
+ geom_feats[:, 3] # 输出是(168648,)
sorts = ranks.argsort() # 输出是(168648,)
x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]
# cumsum trick
# 这个操作就是根据计算得到的ranks,将相同rank的特征叠加到一起
if not self.use_quickcumsum:
x, geom_feats = cumsum_trick(x, geom_feats, ranks)
else:
x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)
# griddify (B x C x Z x X x Y)
# 根据索引将特征一一对应到世界坐标系下
final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device) # 输出是(4, 64, 1, 200, 200)
final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x
# collapse Z
# Z轴上压缩,Z轴方向上其实就一个体素网格
# (B * C * Z * X * Y) --> (B * C * X * Y)
final = torch.cat(final.unbind(dim=2), 1) # 输出是(4, 64, 200, 200)
return final
argsort()对数据的大小进行排序,并且返回索引,后面再用sorts接收返回的索引,利用这个索引对x, geom_feats, ranks重新进行排序
再看cumsum_trick()函数:
这里可以看最上面的链接解读,有比较详细的说法
def cumsum_trick(x, geom_feats, ranks):
x = x.cumsum(0)
kept = torch.ones(x.shape[0], device=x.device, dtype=torch.bool)
kept[:-1] = (ranks[1:] != ranks[:-1])
x, geom_feats = x[kept], geom_feats[kept]
x = torch.cat((x[:1], x[1:] - x[:-1]))
return x, geom_feats
2.2 self.bevencode():
将self.get_voxel()得到的输出进行编码?感觉说是解码更合适
class BevEncode(nn.Module):
def __init__(self, inC, outC):
super(BevEncode, self).__init__()
# inC=64, outC=1
trunk = resnet18(pretrained=False, zero_init_residual=True)
self.conv1 = nn.Conv2d(inC, 64, kernel_size=7, stride=2, padding=3,
bias=False)
self.bn1 = trunk.bn1
self.relu = trunk.relu
self.layer1 = trunk.layer1 # 64->64
self.layer2 = trunk.layer2 # 64->128
self.layer3 = trunk.layer3 # 128->256
self.up1 = Up(64+256, 256, scale_factor=4)
self.up2 = nn.Sequential(
nn.Upsample(scale_factor=2, mode='bilinear',
align_corners=True),
nn.Conv2d(256, 128, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, outC, kernel_size=1, padding=0),
)
def forward(self, x):
# x={Tensor:(4, 64, 200, 200)}
x = self.conv1(x) # 输出是(4, 64, 100, 100)
x = self.bn1(x)
x = self.relu(x)
x1 = self.layer1(x) # 输出是(4, 64, 100, 100)
x = self.layer2(x1) # 输出是(4, 128, 50, 50)
x = self.layer3(x) # 输出是(4, 256, 25, 25)
x = self.up1(x, x1) # 输出是(4, 256, 100, 100)
x = self.up2(x) # 输出是(4, 1, 200, 200)
return x
基本上整个算法模型到这就完了
输出的预测是(4, 1, 200, 200)
2.3 得到输出之后,后处理部分
def get_val_info(model, valloader, loss_fn, device, use_tqdm=False):
model.eval()
total_loss = 0.0
total_intersect = 0.0
total_union = 0
print('running eval...')
loader = tqdm(valloader) if use_tqdm else valloader
with torch.no_grad():
for batch in loader:
allimgs, rots, trans, intrins, post_rots, post_trans, binimgs = batch
preds = model(allimgs.to(device), rots.to(device),
trans.to(device), intrins.to(device), post_rots.to(device),
post_trans.to(device)) # 输出是(4, 1, 200, 200)
binimgs = binimgs.to(device) # 输出是(4, 1, 200, 200)
# loss
total_loss += loss_fn(preds, binimgs).item() * preds.shape[0]
# iou
intersect, union, _ = get_batch_iou(preds, binimgs)
total_intersect += intersect
total_union += union
model.train()
return {
'loss': total_loss / len(valloader.dataset),
'iou': total_intersect / total_union,
}
得到iou,用来计算平均iou
def get_batch_iou(preds, binimgs):
"""Assumes preds has NOT been sigmoided yet
"""
with torch.no_grad():
pred = (preds > 0) # 输出是(4, 1, 200, 200)
tgt = binimgs.bool()
intersect = (pred & tgt).sum().float().item()
union = (pred | tgt).sum().float().item()
return intersect, union, intersect / union if (union > 0) else 1.0
2.结果复现和模型修改
1.结果复现
利用脚本进行训练, 未对train.py的传参进行修改
from src.train import train
train(version="trainval", dataroot="../data/nuScenes", gpuid=0)
同时启用tensorboard进行观察
tensorboard --logdir=./runs --bind_all
发现训练基本正常:
随着训练的推进, iou逐步增长, 基本10000次迭代之后, iou达到0.25左右
到这代码基本跑通
2.模型修改
1.修改backbone
采用ConvNeXt-tiny替换EfficientNet
核心问题:
训练的时候, 有loss, 但iou为0
此问题以解决
在这里, 记录一下碰到问题细节:
1. LSS模型中的特征提取器, 主要就是涉及一个backbone提取特征, 然后将最后一层的输出进行上采样, 然后与倒数第二层的输出进行连接, 最后进行通道调整, 这几个步骤的具体修改和问题, 后面再补充
2. 在修改backbone之后, 利用上面结果复现的做法进行训练, 发现打印的loss基本正常, 但iou始终为0
目前初步判断是特征提取时的softmax和backbone本身的归一化冲突
通过注释softmax, 发现并不是这个问题
通过20个小时的训练, 单卡3060, 在batch_size=4的情况下, 接近30000次迭代的时候, iou开始增长, 20个小时大概跑了180000个迭代
3. Ncams=5, 这个传参存在疑问, 目前不记得是不是论文中有提到随机mask掉一个相机进行训练(这部分的代码如下, 看这个代码, 好像确实是这样)---src/data.py#L195-L201
def choose_cams(self):
if self.is_train and self.data_aug_conf['Ncams'] < len(self.data_aug_conf['cams']):
cams = np.random.choice(self.data_aug_conf['cams'], self.data_aug_conf['Ncams'],
replace=False)
else:
cams = self.data_aug_conf['cams']
return cams