VoxelNet
参考博客:https://blog.csdn.net/yanghao201607030101/article/details/114708548?utm_medium=distribute.pc_relevant.none-task-blog-2defaultbaidujs_title~default-0.control&spm=1001.2101.3001.4242
背景
现阶段大多数算法主要集中解决传统手工特征(参考博客)设置上,但是这些手工特征设计存在一定的信息瓶颈,不能很好获取三维数据的信息。
解决的问题
- 当有丰富和详细的3D形状特征信息可用时,传统手工制作特征可以得到令人满意的结果。但是当场景或者识别形状复杂时不能得到很好的结果。因此提出一种端到端的深度模型框架,直接可以在稀疏的3D点云是那个操作,避免了信息瓶颈产生。
- 不仅能高效处理稀疏点结构,有有利于体素网格上的高效并行处理。
网络架构
- Feature Learning Network
Voxel Partition(体素划分)
将所给的3D数据进行空间细分成等距的体素。假设Z,Y,X轴对应的3D数据长度分别为D,H,W,并且假设每一个体素的大小为 v D , v H , v W v_D,v_H,v_W vD,vH,vW,即切割处理的体素对应轴的数量为 D ‘ = D / v D , H ‘ = H / v H , W ‘ = W / v W D^`=D/v_D,H^`=H/v_H,W^`=W/v_W D‘=D/vD,H‘=H/vH,W‘=W/vW。
Grouping
根据前面划分好的体素来对点云数据进行分组。LiDAR所得到的点云数据是稀疏的,且点云的密度是不相同的,因此在体素划分之后所得到的体素内的点云的密度是不一致的。如上图所示可以看到体素之间的点云密度是不相同的,其中体素1是点云密度较大的体素,体素3点云密度较低。
Random Sampling(随机采样)
LiDAR直接得到的点云数据可达到100k,直接对这些数据进行处理消耗的计算机资源是极大的,其次由于LiDAR所得到的点云数据是不平衡的,因此会导致采样过程发生偏差。因此随机从包含多余 T T T个点的体素中随机采样 T T T个点。
Stacked Voxel Feature Encoding(堆叠体素特征编码)
此一部分为最主要的创新点所在----VFE层。如上图Figure 3所示为VFE层的简略网络图。假设点云的数据格式为
V
=
{
p
i
=
{
x
i
,
y
i
,
z
i
,
r
i
}
T
∈
R
4
}
i
=
1....
t
V\;=\;{\left\{p^i\;=\;\left\{x^i\;,y^i\;,z^i\;,r^i\right\}^T\in\mathbb{R}^4\right\}}_{i=1....t}
V={pi={xi,yi,zi,ri}T∈R4}i=1....t,其中
p
i
p_i
pi表示点云的三维坐标以及相关信息,
r
i
r_i
ri表示点云的反射率。因此计算出
V
V
V的坐标均值
(
v
x
,
v
y
,
v
z
)
(v^x,v^y,v^z)
(vx,vy,vz),并计算每一个点的偏移量,即此时输入特征向量则变为:
V
i
n
=
{
p
i
=
{
x
i
,
y
i
,
z
i
,
r
i
,
x
i
−
v
x
,
,
y
i
−
v
y
,
z
i
−
v
z
}
T
∈
R
7
}
i
=
1....
t
V_{in}\;=\;{\left\{p^i\;=\;\left\{x^i\;,y^i\;,z^i\;,r^i,x^i-v^x,,y^i-v^y,z^i-v^z\right\}^T\in\mathbb{R}^7\right\}}_{i=1....t}
Vin={pi={xi,yi,zi,ri,xi−vx,,yi−vy,zi−vz}T∈R7}i=1....t。然后通过全连接层对每个点的特征进行提取,接着使用maxpooling进行特征聚合,得到局部聚合特征。最后将全连接层得到的特征与maxpooling后的特征进行拼接,使得特征不止有全局信息,也包含局部特征新。
Sparse Tensor Representation(稀疏张量表示)
通过只处理非空体素,可以得到一个体素列表,获得的体素列表可以表示为一个稀疏的4D张量。虽然采集的点云包含100k个点,但是通过上述方法处理得到的体素90%。
通常是空的,即非空体素作为稀疏张量极大减少了计算资源。
- Convolutional middle layers
使用3D卷积进行特征提取,后接BN和Relu。随着卷积逐渐加深,感受野变大,获得的全局信息更加完善。
- Region Proposal Network
网络包含三个全卷积层块,每个块的第一层通过步长为2的卷积将特征图进行采样到初始大小的一半,之后是步长为1的卷积层,每个卷积层都包含BN和Relu操作。将每个块的输出都上采样到一个固定的尺寸并串联构造高分辨率的特征图。一个为概率得分图,另外一个为回归图。
- Loss Function- Efficient Implementation
通过点的三维坐标与原先点的网络之间相对应,组成了一个hash表。由于点云是稀疏的。90%的网格中是没有点的,因此不对其进行录入,大大节省了内存消耗空间。而由于后面的3D CNN是需要密集卷积才能并行计算的,因此那些没有网格的用0来填充。
训练细节
- 车辆检测: 选取点云(z, x, y)范围[−3, 1] × [−40, 40] × [0, 70.4],提升网格大小长宽高是0.4m,0.2m和0.2m,只用了一个anchor,长宽高分别为3.9m,1.6m和1.56m. iou>0.6为正例anchor,<0.45为负例anchor,中间的忽略。
- 行人和自行车检测: 选取点云(z, x, y)范围[−3, 1] × [−20, 40] × [0, 48],沿用车辆检测所分的体素网格,只用了一个anchor,行人长宽高分别为0.8m,0.6m和1.73m。自行车anchor长宽高1.76m, 0.6m和1.73m. iou>0.5为正例anchor, <0.35为负例anchor,中间的忽略。
- 数据增强: ①gt 3D box里的点随机偏移一定距离以及沿着z轴旋转一定角度;
②对GT box以及点云中所有的点进行随机缩放,缩放比例为[0.95, 1.05],主要是为提高模型对不同尺度目标的鲁棒性。
部分代码讲解
参考代码:https://github.com/skyhehe123/VoxelNet-pytorch
基本的conv2d+bn+relu、conv3d+bn+relu和fully Connected组件
# conv2d + bn + relu
class Conv2d(nn.Module):
def __init__(self,in_channels,out_channels,k,s,p, activation=True, batch_norm=True):
super(Conv2d, self).__init__()
self.conv = nn.Conv2d(in_channels,out_channels,kernel_size=k,stride=s,padding=p)
if batch_norm:
self.bn = nn.BatchNorm2d(out_channels)
else:
self.bn = None
self.activation = activation
def forward(self,x):
x = self.conv(x)
if self.bn is not None:
x=self.bn(x)
if self.activation:
return F.relu(x,inplace=True)
else:
return x
# conv3d + bn + relu
class Conv3d(nn.Module):
def __init__(self, in_channels, out_channels, k, s, p, batch_norm=True):
super(Conv3d, self).__init__()
self.conv = nn.Conv3d(in_channels, out_channels, kernel_size=k, stride=s, padding=p)
if batch_norm:
self.bn = nn.BatchNorm3d(out_channels)
else:
self.bn = None
def forward(self, x):
x = self.conv(x)
if self.bn is not None:
x = self.bn(x)
return F.relu(x, inplace=True)
# Fully Connected Network
class FCN(nn.Module):
def __init__(self,cin,cout):
super(FCN, self).__init__()
self.cout = cout
self.linear = nn.Linear(cin, cout)
self.bn = nn.BatchNorm1d(cout)
def forward(self,x):
# KK is the stacked k across batch
kk, t, _ = x.shape
x = self.linear(x.view(kk*t,-1))
x = F.relu(self.bn(x))
return x.view(kk,t,-1)
VFE模块
# Voxel Feature Encoding layer
class VFE(nn.Module):
def __init__(self,cin,cout):
super(VFE, self).__init__()
assert cout % 2 == 0
self.units = cout // 2
self.fcn = FCN(cin,self.units)
def forward(self, x, mask):
# point-wise feauture
pwf = self.fcn(x)
#locally aggregated feature
laf = torch.max(pwf,1)[0].unsqueeze(1).repeat(1,cfg.T,1)
# point-wise concat feature
pwcf = torch.cat((pwf,laf),dim=2)
# apply mask
mask = mask.unsqueeze(2).repeat(1, 1, self.units * 2)#假如网格点云数量不足T,则进行mask掩盖,不需参与计算
pwcf = pwcf * mask.float()
return pwcf
SVFE模块
# Stacked Voxel Feature Encoding
class SVFE(nn.Module):
def __init__(self):
super(SVFE, self).__init__()
self.vfe_1 = VFE(7,32)
self.vfe_2 = VFE(32,128)
self.fcn = FCN(128,128)
def forward(self, x):
mask = torch.ne(torch.max(x,2)[0], 0)
x = self.vfe_1(x, mask)
x = self.vfe_2(x, mask)
x = self.fcn(x)
# element-wise max pooling
x = torch.max(x,1)[0]
return x
Middle layer模块
# Convolutional Middle Layer
class CML(nn.Module):
def __init__(self):
super(CML, self).__init__()
self.conv3d_1 = Conv3d(128, 64, 3, s=(2, 1, 1), p=(1, 1, 1))
self.conv3d_2 = Conv3d(64, 64, 3, s=(1, 1, 1), p=(0, 1, 1))
self.conv3d_3 = Conv3d(64, 64, 3, s=(2, 1, 1), p=(1, 1, 1))
def forward(self, x):
x = self.conv3d_1(x)
x = self.conv3d_2(x)
x = self.conv3d_3(x)
return x
RPN模块
# Region Proposal Network
class RPN(nn.Module):
def __init__(self):
super(RPN, self).__init__()
self.block_1 = [Conv2d(128, 128, 3, 2, 1)]
self.block_1 += [Conv2d(128, 128, 3, 1, 1) for _ in range(3)]
self.block_1 = nn.Sequential(*self.block_1)
self.block_2 = [Conv2d(128, 128, 3, 2, 1)]
self.block_2 += [Conv2d(128, 128, 3, 1, 1) for _ in range(5)]
self.block_2 = nn.Sequential(*self.block_2)
self.block_3 = [Conv2d(128, 256, 3, 2, 1)]
self.block_3 += [nn.Conv2d(256, 256, 3, 1, 1) for _ in range(5)]
self.block_3 = nn.Sequential(*self.block_3)
self.deconv_1 = nn.Sequential(nn.ConvTranspose2d(256, 256, 4, 4, 0),nn.BatchNorm2d(256))
self.deconv_2 = nn.Sequential(nn.ConvTranspose2d(128, 256, 2, 2, 0),nn.BatchNorm2d(256))
self.deconv_3 = nn.Sequential(nn.ConvTranspose2d(128, 256, 1, 1, 0),nn.BatchNorm2d(256))
self.score_head = Conv2d(768, cfg.anchors_per_position, 1, 1, 0, activation=False, batch_norm=False)
self.reg_head = Conv2d(768, 7 * cfg.anchors_per_position, 1, 1, 0, activation=False, batch_norm=False)
def forward(self,x):
x = self.block_1(x)
x_skip_1 = x
x = self.block_2(x)
x_skip_2 = x
x = self.block_3(x)
x_0 = self.deconv_1(x)
x_1 = self.deconv_2(x_skip_2)
x_2 = self.deconv_3(x_skip_1)
x = torch.cat((x_0,x_1,x_2),1)
return self.score_head(x),self.reg_head(x)
总网络
class VoxelNet(nn.Module):
def __init__(self):
super(VoxelNet, self).__init__()
self.svfe = SVFE()
self.cml = CML()
self.rpn = RPN()
def voxel_indexing(self, sparse_features, coords):
dim = sparse_features.shape[-1]
dense_feature = Variable(torch.zeros(dim, cfg.N, cfg.D, cfg.H, cfg.W).cuda())
dense_feature[:, coords[:,0], coords[:,1], coords[:,2], coords[:,3]]= sparse_features
return dense_feature.transpose(0, 1)
def forward(self, voxel_features, voxel_coords):
# feature learning network
vwfs = self.svfe(voxel_features)
vwfs = self.voxel_indexing(vwfs,voxel_coords)
# convolutional middle network
cml_out = self.cml(vwfs)
# region proposal network
# merge the depth and feature dim into one, output probability score map and regression map
psm,rm = self.rpn(cml_out.view(cfg.N,-1,cfg.H, cfg.W))
return psm, rm
Loss函数
class VoxelLoss(nn.Module):
def __init__(self, alpha, beta):
super(VoxelLoss, self).__init__()
self.smoothl1loss = nn.SmoothL1Loss(size_average=False)
self.alpha = alpha
self.beta = beta
def forward(self, rm, psm, pos_equal_one, neg_equal_one, targets):
p_pos = F.sigmoid(psm.permute(0,2,3,1))
rm = rm.permute(0,2,3,1).contiguous()
rm = rm.view(rm.size(0),rm.size(1),rm.size(2),-1,7)
targets = targets.view(targets.size(0),targets.size(1),targets.size(2),-1,7)
pos_equal_one_for_reg = pos_equal_one.unsqueeze(pos_equal_one.dim()).expand(-1,-1,-1,-1,7)
rm_pos = rm * pos_equal_one_for_reg
targets_pos = targets * pos_equal_one_for_reg
cls_pos_loss = -pos_equal_one * torch.log(p_pos + 1e-6)
cls_pos_loss = cls_pos_loss.sum() / (pos_equal_one.sum() + 1e-6)
cls_neg_loss = -neg_equal_one * torch.log(1 - p_pos + 1e-6)
cls_neg_loss = cls_neg_loss.sum() / (neg_equal_one.sum() + 1e-6)
reg_loss = self.smoothl1loss(rm_pos, targets_pos)
reg_loss = reg_loss / (pos_equal_one.sum() + 1e-6)
conf_loss = self.alpha * cls_pos_loss + self.beta * cls_neg_loss
return conf_loss, reg_loss