BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection(从多相机3D目标检测中利用时间线索)
论文速读
Summary:
a. Research background of this article:
本文在自动驾驶领域探讨基于视频的感知任务,如3D目标检测。现有方法限制了精确预测速度的能力,导致表现不如基于LiDAR或雷达的方法。 因此,本文提出了一种新的范式BEVDet4D,通过融合前一帧视图的特征和当前帧的特征,将BEVDet框架的空间3D工作空间扩展到空间时间4D工作空间,以便访问时间线索并将速度预测简化为位置偏移量预测。
b. Past methods, their problems, and motivation:
已有的基于图像或视频的检测/跟踪方法无法准确预测目标的速度。
c. Research methodology proposed in this paper:
本文提出了一种新的基于视频的检测方法BEVDet4D,在BEVDet基线的基础上扩展了空间时间4D工作空间,通过融合前一帧视图的特征和当前帧的特征,从而将空间3D工作空间扩展到时间空间4D工作空间。其使用特征对齐操作来避免运动畸变对学习目标的影响。并将速度预测简化为位置偏移量预测。
d. Task and performance achieved by the methods in this paper:
BEVDet4D优于其他最先进的方法,实现了更高的检测mAP,更低的速度和方向误差。
Background:
a. Subject and characteristics:
本文研究自动驾驶的视觉感知和多摄像机的3D目标检测。
b. Historical development:
已有的基于图像或视频的检测/跟踪方法无法准确预测目标的速度。
c. Past methods:
已有的基于图像或视频的检测/跟踪方法。
d. Past research shortcomings:
无法准确预测目标的速度。
e. Current issues to address:
如何提升基于视频的3D目标检测的速度
Methods:
a. Theoretical basis of the study:
基于深度学习的计算机视觉技术与自动驾驶技术结合。
b. Technical route of the article (step by step):
本文使用BEVDet4D方法。
首先,使用BEV视角编码器生成BEV特征。
然后,通过BEV编码器调整特征候选项。
接着,在视图转换器中使用BEV特征生成当前帧的加权特征。
最后,将当前帧的加权特征与前一帧的BEV特征进行融合,进一步扩展3D空间到4D空间。
c. Results:
本文的方法明显优于其他已有方法,尤其是在速度估计方面。
Conclusion:
a. Significance of the work:
BEVDet4D通过融合前一帧视图的特征和当前帧的特征,从而将BEVDet框架的空间3D工作空间扩展到空间时间4D工作空间,是一种充分利用时间线索的新方法。
b. Innovation, performance, and workload:
本文所提出的BEVDet4D方法最具创新性之处是通过利用相关帧之间的时间和空间信息来提高当前帧中对象检测的精度。该方法不仅有较好的检测表现,且在推理延迟方面与其他方法相当。
c. Research conclusions (list points):
BEVDet4D是一种新颖的无人驾驶目标检测范式,通过利用时间线索,可在保持实时性的同时提高检测的精度。
BEVDet4D通过将4D空间中的空间和时间信息融合来预测目标位置偏移量,进一步扩展空间3D工作空间。
BEVDet4D的性能比已有方法明显优越。
该方法具有现实意义并可在未来的无人驾驶技术中应用。
转ONNX
(默认已基本调试成功)
首先Git源代码,下面给出新版链接bevdet4d-one
1.复制mmdet3d/models/detectors/bevdet.py 该函数到页面bevdet4d类下
@DETECTORS.register_module()
class BEVDetTRT(BEVDet):
def result_serialize(self, outs):
outs_ = []
for out in outs:
for key in ['reg', 'height', 'dim', 'rot', 'vel', 'heatmap']:
outs_.append(out[0][key])
return outs_
def result_deserialize(self, outs):
outs_ = []
keys = ['reg', 'height', 'dim', 'rot', 'vel', 'heatmap']
for head_id in range(len(outs) // 6):
outs_head = [dict()]
for kid, key in enumerate(keys):
outs_head[0][key] = outs[head_id * 6 + kid]
outs_.append(outs_head)
return outs_
def forward(
self,
img,
ranks_depth,
ranks_feat,
ranks_bev,
interval_starts,
interval_lengths,
):
x = self.img_backbone(img)
x = self.img_neck(x)
x = self.img_view_transformer.depth_net(x)
depth = x[:, :self.img_view_transformer.D].softmax(dim=1)
tran_feat = x[:, self.img_view_transformer.D:(
self.img_view_transformer.D +
self.img_view_transformer.out_channels)]
tran_feat = tran_feat.permute(0, 2, 3, 1)
x = TRTBEVPoolv2.apply(depth.contiguous(), tran_feat.contiguous(),
ranks_depth, ranks_feat, ranks_bev,
interval_starts, interval_lengths)
x = x.permute(0, 3, 1, 2).contiguous()
bev_feat = self.bev_encoder(x)
outs = self.pts_bbox_head([bev_feat])
outs = self.result_serialize(outs)
return outs
def get_bev_pool_input(self, input):
input = self.prepare_inputs(input)
coor = self.img_view_transformer.get_lidar_coor(*input[1:7])
return self.img_view_transformer.voxel_pooling_prepare_v2(coor)
粘贴到 561行(由于是按照思路一步一步来,所以后续会继续针对4D改动部分再次更改代码)
@DETECTORS.register_module()
class BEVDepth4DTRT(BEVDepth4D):
def result_serialize(self, outs):
outs_ = []
for out in outs:
for key in ['reg', 'height', 'dim', 'rot', 'vel', 'heatmap']:
outs_.append(out[0][key])
return outs_
def result_deserialize(self, outs):
outs_ = []
keys = ['reg', 'height', 'dim', 'rot', 'vel', 'heatmap']
for head_id in range(len(outs) // 6):
outs_head = [dict()]
for kid, key in enumerate(keys):
outs_head[0][key] = outs[head_id * 6 + kid]
outs_.append(outs_head)
return outs_
# 暂未找到跳转到forword函数,模型是直接跳转的
# def forward_pre(
def forward(
self,
img,
ranks_depth,
ranks_feat,
ranks_bev,
interval_starts,
interval_lengths,
):
x = self.img_backbone(img)
x = self.img_neck(x)
x = self.img_view_transformer.depth_net(x)
depth = x[:, :self.img_view_transformer.D].softmax(dim=1)
tran_feat = x[:, self.img_view_transformer.D:(
self.img_view_transformer.D +
self.img_view_transformer.out_channels)]
tran_feat = tran_feat.permute(0, 2, 3, 1)
x = TRTBEVPoolv2.apply(depth.contiguous(), tran_feat.contiguous(),
ranks_depth, ranks_feat, ranks_bev,
interval_starts, interval_lengths)
x = x.permute(0, 3, 1, 2).contiguous()
bev_feat = self.bev_encoder(x)
outs = self.pts_bbox_head([bev_feat])
outs = self.result_serialize(outs)
return outs
def forward_post(
self,
img,
ranks_depth,
ranks_feat,
ranks_bev,
interval_starts,
interval_lengths,
):
x = self.img_backbone(img)
x = self.img_neck(x)
x = self.img_view_transformer.depth_net(x)
depth = x[:, :self.img_view_transformer.D].softmax(dim=1)
tran_feat = x[:, self.img_view_transformer.D:(
self.img_view_transformer.D +
self.img_view_transformer.out_channels)]
tran_feat = tran_feat.permute(0, 2, 3, 1)
x = TRTBEVPoolv2.apply(depth.contiguous(), tran_feat.contiguous(),
ranks_depth, ranks_feat, ranks_bev,
interval_starts, interval_lengths)
x = x.permute(0, 3, 1, 2).contiguous()
bev_feat = self.bev_encoder(x)
outs = self.pts_bbox_head([bev_feat])
outs = self.result_serialize(outs)
return outs
## 后续会进行再次更改
def get_bev_pool_input(self, input):
input = self.prepare_inputs(input)
coor = self.img_view_transformer.get_lidar_coor(*input[1:7])
return self.img_view_transformer.voxel_pooling_prepare_v2(coor)
2.复制tools/convert_bevdet_to_TRT.py 并重命名tools/convert_bevdet4d_depth_pre.py
赋值
parser.add_argument('--config',default="configs/bevdet/bevdet-r50-4d-depth-cbgs.py",help='deploy config file path')
parser.add_argument('--checkpoint', default="ckpt/bevdet-r50-4d-depth-cbgs.pth" ,help='checkpoint file')
parser.add_argument('--work_dir',default="bevdet4d", help='work dir to save file')
4D的input是两帧数据的,思想是onnx依旧按单帧数据进行读取,然后在bev是再叠加转成onnx时可以将两帧数据划分成单帧数据再传入onnx.export,同时因为模型新加入了深度信息所以也需要进行对应的更改
for i, data in enumerate(data_loader):
inputs = [t.cuda() for t in data['img_inputs'][0]]
img_ , metas , mlp_input = model.get_bev_pool_input(inputs)
img = img_.squeeze(0)
with torch.no_grad():
torch.onnx.export(
model,
(img.float().contiguous(), metas[1].int().contiguous(),
metas[2].int().contiguous(), metas[0].int().contiguous(),
metas[3].int().contiguous(), metas[4].int().contiguous(),mlp_input),
args.work_dir + model_prefix + '.onnx',
opset_version=11,
input_names=[
'img', 'ranks_depth', 'ranks_feat', 'ranks_bev',
'interval_starts', 'interval_lengths'
],
# output_names=[f'output_{j}' for j in
# range(6 * len(model.pts_bbox_head.task_heads))]
output_names = ['bev_feat']
)
break
metas[0]的shape是179832 ≈ 256x704
img 的shape是 1x6x3x256x704
因此继续修改函数中 get_bev_pool_input 函数将其输入改成单帧输入,并增加mlp输入。
在这一步 img_ , metas , mlp_input = model.get_bev_pool_input(inputs) 会跳转到下面函数
def get_bev_pool_input(self, input):
input = self.prepare_inputs(input)
# coor.shape = 1 6 59 16 44 3
# 列表推导式加索引,一次性获取每个子元组中数组的第一个元素
arr = [sub_t[0] for sub_t in input[0:7]]
coor = self.img_view_transformer.get_lidar_coor_4d(*arr[1:7])
mlp_input = self.img_view_transformer.get_mlp_input(*arr[1:7])
return input[0][0] , self.img_view_transformer.voxel_pooling_prepare_v2(coor) , mlp_input
def get_lidar_coor_4d(self, sensor2ego, ego2global, cam2imgs, post_rots, post_trans,
bda):
"""Calculate the locations of the frustum points in the lidar
coordinate system.
Args:
rots (torch.Tensor): Rotation from camera coordinate system to
lidar coordinate system in shape (B, N_cams, 3, 3).
trans (torch.Tensor): Translation from camera coordinate system to
lidar coordinate system in shape (B, N_cams, 3).
cam2imgs (torch.Tensor): Camera intrinsic matrixes in shape
(B, N_cams, 3, 3).
post_rots (torch.Tensor): Rotation in camera coordinate system in
shape (B, N_cams, 3, 3). It is derived from the image view
augmentation.
post_trans (torch.Tensor): Translation in camera coordinate system
derived from image view augmentation in shape (B, N_cams, 3).
Returns:
torch.tensor: Point coordinates in shape
(B, N_cams, D, ownsample, 3)
"""
B, N, _, _ = sensor2ego.shape
# post-transformation
# B x N x D x H x W x 3
# 利用 PyTorch 的切片操作,取出张量中的奇数行和偶数行
points = (self.frustum[::2, :, :, :].reshape(59, 16, 44, 3)).to(sensor2ego) - post_trans.view(B, N, 1, 1, 1, 3)
points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3)\
.matmul(points.unsqueeze(-1))
# cam_to_ego
points = torch.cat(
(points[..., :2, :] * points[..., 2:3, :], points[..., 2:3, :]), 5)
combine = sensor2ego[:,:,:3,:3].matmul(torch.inverse(cam2imgs))
points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
points += sensor2ego[:,:,:3, 3].view(B, N, 1, 1, 1, 3)
points = bda.view(B, 1, 1, 1, 1, 3,
3).matmul(points.unsqueeze(-1)).squeeze(-1)
return points
同时给出voxel_pooling_prepare_v2在debug时各个shape
def voxel_pooling_prepare_v2(self, coor):
"""Data preparation for voxel pooling.
Args:
coor (torch.tensor): Coordinate of points in the lidar space in
shape (B, N, D, H, W, 3).
Returns:
tuple[torch.tensor]: Rank of the voxel that a point is belong to
in shape (N_Points); Reserved index of points in the depth
space in shape (N_Points). Reserved index of points in the
feature space in shape (N_Points).
"""
B, N, D, H, W, _ = coor.shape
# num_points 249216
num_points = B * N * D * H * W
# record the index of selected points for acceleration purpose ranks_depth 249216
ranks_depth = torch.range(
0, num_points - 1, dtype=torch.int, device=coor.device)
ranks_feat = torch.range(
0, num_points // D - 1, dtype=torch.int, device=coor.device)
ranks_feat = ranks_feat.reshape(B, N, 1, H, W)
ranks_feat = ranks_feat.expand(B, N, D, H, W).flatten()
# convert coordinate into the voxel space || coor shape:torch.Size([1, 6, 59, 16, 44, 3])
coor = ((coor - self.grid_lower_bound.to(coor)) /
self.grid_interval.to(coor))
coor = coor.long().view(num_points, 3) # || torch.Size([249216, 3])
batch_idx = torch.range(0, B - 1).reshape(B, 1). \
expand(B, num_points // B).reshape(num_points, 1).to(coor) # shape:torch.Size([249216, 1])
coor = torch.cat((coor, batch_idx), 1) # shape:torch.Size([249216, 4])
# filter out points that are outside box kept shape torch.Size([249216])
kept = (coor[:, 0] >= 0) & (coor[:, 0] < self.grid_size[0]) & \
(coor[:, 1] >= 0) & (coor[:, 1] < self.grid_size[1]) & \
(coor[:, 2] >= 0) & (coor[:, 2] < self.grid_size[2])
if len(kept) == 0:
return None, None, None, None, None
coor, ranks_depth, ranks_feat = \
coor[kept], ranks_depth[kept], ranks_feat[kept]
# get tensors from the same voxel next to each other # shape:torch.Size([179832])
# (self.grid_size[2] * self.grid_size[1] * self.grid_size[0]) = 16384
# coor[:, 3].shape = 179832
ranks_bev = coor[:, 3] * (
self.grid_size[2] * self.grid_size[1] * self.grid_size[0])
ranks_bev += coor[:, 2] * (self.grid_size[1] * self.grid_size[0])
ranks_bev += coor[:, 1] * self.grid_size[0] + coor[:, 0]
order = ranks_bev.argsort()
ranks_bev, ranks_depth, ranks_feat = \
ranks_bev[order], ranks_depth[order], ranks_feat[order]
kept = torch.ones(
ranks_bev.shape[0], device=ranks_bev.device, dtype=torch.bool)
kept[1:] = ranks_bev[1:] != ranks_bev[:-1]
interval_starts = torch.where(kept)[0].int()
if len(interval_starts) == 0:
return None, None, None, None, None
interval_lengths = torch.zeros_like(interval_starts)
interval_lengths[:-1] = interval_starts[1:] - interval_starts[:-1]
interval_lengths[-1] = ranks_bev.shape[0] - interval_starts[-1]
return ranks_bev.int().contiguous(), ranks_depth.int().contiguous(
), ranks_feat.int().contiguous(), interval_starts.int().contiguous(
), interval_lengths.int().contiguous()
至此第一部分基本上大功告成了,打开netron可以查看各节点名称以及shape
3,未解决的问题
1、并未转为int8模型