【论文阅读】【三维目标检测】VoteNet：Deep Hough Voting for 3D Object Detection in Point Clouds

最新推荐文章于 2024-08-19 10:28:34 发布

麒麒哈尔

最新推荐文章于 2024-08-19 10:28:34 发布

阅读量1w

点赞数 10

分类专栏：论文阅读文章标签： CNN 深度学习无人驾驶 3D Object Detection

本文链接：https://blog.csdn.net/wqwqqwqw1231/article/details/101283243

版权

VoteNet是解决3D物体检测问题的一种方法，通过深度Hough投票策略，有效地处理3D对象中心远离点云表面的挑战。网络结构包括PointNet++作为主干网络，结合投票生成、聚合和对象提案分类模块。实验表明，VoteNet在SUN RGB-D和ScanNet数据集上优于先前方法，证实了其有效性和效率。

摘要由CSDN通过智能技术生成

文章：Deep Hough Voting for 3D Object Detection in Point Clouds
2019CVPR
Charles R. Qi 与 Kaiming He两位大佬的文章

Hough Voting

本文的标题是Deep Hough Voting，先来说一下Hough Voting。

用Hough变换检测直线大家想必都听过：对于一条直线，可以使用(r, θ)两个参数进行描述，那么对于图像中的一点，过这个点的直线有很多条，可以生成一系列的(r, θ)，在参数平面内就是一条曲线，也就是说，一个点对应着参数平面内的一个曲线。那如果有很多个点，则会在参数平面内生成很多曲线。那么，如果这些点是能构成一条直线的，那么这条直线的参数(r*, θ*)就在每条曲线中都存在，所以看起来就像是多条曲线相交在（r*,θ*）。可以用多条曲线投票的方式来看，其他点都是很少的票数，而（r*,θ*）则票数很多，所以直线的参数就是（r*,θ*）。

所以Hough变换的思想就是在于，在参数空间内进行投票，投票得数高的就是要得到的值。

文中提到的Hough Voting如下：
A traditional Hough voting 2D detector [24] comprises an offline and an online step. First, given a collection of images with annotated object bounding boxes, a codebook is constructed with stored mappings between image patches (or their features) and their offsets to the corresponding object centers. At inference time, interest points are selected from the image to extract patches around them. These patches are then compared against patches in the codebook to retrieve offsets and compute votes. As object patches will tend to vote in agreement, clusters will form near object centers. Finally, the object boundaries are retrieved by tracing cluster votes back to their generating patches.
对于这一段话的理解则是，已经有了一部分已经标注好的框和图片（或者feature），那么每一个框中的图片或者feature就相当于直线检测中的点，框相对于物体中心点的offset就相当于要vote的参数。在inference时，先选取一些RoI，然后将这些RoI或者其feature放入参数空间内，检索offset并且计算vote。具体的，也可以参考引文[24]

VoteNet

本文要解决的核心问题是不同于2D Object Detection，3D 物体的中心往往离扫描到的点有一定距离，而且在空白处：
We face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step.

相对应Hough Vote，本文提出了VoteNet的网络结构，对Hough Vote做了如下的改进：
“Interest points are described and selected by deep neural networks instead of depending on hand-crafted features.
Vote generation is learned by a network instead of using a codebook. Levaraging larger receptive fields, voting can be made less ambiguous and thus more effective. In addition, a vote location can be augmented with a feature vector allowing for better aggregation.
Vote aggregation is realized through point cloud process- ing layers with trainable parameters. Utilizing the vote fea- tures, the network can potentially filter out lowquality votes and generate improved proposals.
Object proposals in the form of: location, dimensions, ori- entation and even semantic classes can be directly generated from the aggregated features, mitigating the need to trace back votes’ origins.”
以上四部分都是针对Hough Vote做的改进，主要是使用神经网络进行兴趣点的选取，生成Vote，聚集Vote和生成框。

网络结构

在这里插入图片描述
上图是VoteNet的网络结构，其实这个结构如果读者看过PointRCNN很容易理解。

Voting in Point Clouds

这部分就是PointNet++作为主干网络，包含4个SA层与2个FP层，得到 $\times (3 + C)$ 的Tensor，这个tensor表示着选出来的Seed和每个Seed对应的特征向量。那么主干网络具体的参数如下表，也就是M=1024，C=256。
在这里插入图片描述

Vote的过程其实就是使用MLP对Seed回归其对应中心的offset的过程。但在回归是，不仅回归其归属的物体的中心，而且回归一个feature的offset。那么这一块的结构文中提到：
“The voting module MLP has output sizes of 256, 256, 259 for its fully connected layers. The last fully connected layer does not have ReLU or BatchNorm.”
得到的位置的offset和feature的offset，element-wise的加到主干网络的输出上，更新Seed的位置和feature。

Object Proposal and Classification from Votes

Sampling and Grouping
对Vote产生的Seed进行furthest sampling，然后按照一定距离内的进行Grouping操作。实验中验证了这个方式的有效性，距离在0.2时是最好的。

Proposal and Classify
对于Grouping得到的特征，送入到PoinNet中，进行一次MLP，然后MaxPooling操作得到特征向量，然后再进行一次MLP得到输出。令人疑惑的是，在正文中说第一个MLP是PointNet-like的形式，而在Appendix中则提到是SA结构的：
“The proposal module as mentioned in the main paper is a SA layer followed by another MLP after the max-pooling in each local region. ”
我认为PointNet的特殊之处是在MLP之前加入了T-Net，但SA module中则是使用相对位置做直接使用MLP的。这一块很疑惑，文中也没有解释太清楚。

第二层的MLP的输出的channel数为 $5 + 2 N H + 4 N S + N C$ ，是参考了FrustumNet的输出，文中解释为：
“The layer’s output has 5+2NH+4NS+NC channels whereNH is the number of heading bins (we predict a classification score for each heading bin and a regression offset for each bin–relative to the bin center and normalized by the bin size), NS is the number of size templates (we predict a classification score for each size template and 3 scale regression offsets for height, width and length) and NC is the number of seman- tic classes. In SUN RGB-D: NH = 12,NS = NC = 10, in ScanNet: NH = 12,NS = NC = 18. In the first 5 channels, the first two are for objectness classification and the rest three are for center regression (relative to the vote cluster center).”

实验

实现细节，在采输入的时候：
“The floor height is estimated as the 1% percentile of all points’ heights. To augment the training data, we randomly sub-sample the points from the scene points on-the-fly.”

实验表明该方法的有效性：
“VoteNet outperforms all previous methods by at least 3.7 and 18.4 mAP increase in SUN RGB-D and ScanNet respectively. Notably,”

Ablation Study
实验验证了Vote的有效性，对比BoxNet（可以认为是PointRCNN中的RPN网络），效果有了明显的提升。而且对于不同的category，扫到的点距离中心点越远，vote的所带的提升越大。

使用对比实验证明了Vote Aggregation的有效性。

实验说明了本网络是的参数少，运算时间快

Appendix
我认为有意思的提升点：
“We report two ways of using the proposals: joint and per-class. For the joint proposal we propose K objects’ bounding boxes for all the 10 categories, where we consider each proposal as the semantic class it has the largest confidence in, and use their objectness scores to rank them. For the per-class proposal we duplicate the K proposal 10 times thus have K proposals per class where we use the multiplication of semantic probability for that class and the objectness prob- ability to rank them. The latterway of using proposals gives us a slight improvement on AP and a big boost on AR.”

思考

这个模型到底是One-stage还是Two-stage？
如果将One-stage和Two-stage的区分是基于是否存在对proposal的优化，那本文中其实只预测一次box，就是在最后。在主干网络的输出中，预测了Seed的偏移，这个部分可以理解为是对Seed的offset，也可以理解为是预测了Proposal的中心。本文中的模型不存在显式地对proposal的优化，但确实在主干网络的输出后还进行了又一个stage的操作。

与PointRCNN的对比
PointRCNN就可以认为是传统的Two-stage的模型。如果相对比来看，本文的全部网络可以当做是PointRCNN的RPN网络出现，然后再用PointRCNN的第二阶段进行优化。
但如果将本文的网络是认为Two-stage的模型，可以认为在Seed的计算offset的阶段，其实是在生成proposal的中心。后面的sampling和grouping则是在进行RoI Crop的操作。

代码解读

代码链接：https://github.com/facebookresearch/votenet
代码写的非常优美，使用end_points字典来储存中间变量，使得查看重要的变量非常方便。
网络分为backbone_net，vgen和pnet几个模块，分别如下：
在这里插入图片描述

  # VoteNet的前向计算过程
  def forward(self, inputs):
        """ Forward pass of the network
        Args:
            inputs: dict
                {point_clouds}

                point_clouds: Variable(torch.cuda.FloatTensor)
                    (B, N, 3 + input_channels) tensor
                    Point cloud to run predicts on
                    Each point in the point-cloud MUST
                    be formated as (x, y, z, features...)
        Returns:
            end_points: dict
        """
        end_points = {
   }  # 储存前向计算过程中的中间变量
        batch_size = inputs['point_clouds'].shape[0]

        end_points = self.backbone_net(inputs['point_clouds'], end_points)
                
        # --------- HOUGH VOTING ---------
        xyz = end_points['fp2_xyz']
        features = end_points['fp2_features']
        end_points['seed_inds'] = end_points['fp2_inds']
        end_points['seed_xyz'] = xyz  # Seeds's xyz
        end_points['seed_features'] = features  # Seeds' feature
        
        xyz, features = self.vgen(xyz, features)  # Vote
        features_norm = torch.norm(features, p=2, dim=1)
        features = features.div(features_norm.unsqueeze(1))
        end_points['vote_xyz'] = xyz  # Votes's xyz
        end_points['vote_features'] = features  # Votes's feature

        end_points = self.pnet(xyz, features, end_points)  # Sampling & Grouping, Porpose & Classify

        return end_points

backbone_net

    
    def __init__(self, input_feature_dim=0):
        """
        backbone_net由4层SA和2层FP构成
        """
        
        super().__init__()
        
        self.sa1 = PointnetSAModuleVotes(
                npoint=2048,  # 采样2048个点
                radius=0.2,  # 每个球域半径0.2m
                nsample=64,  # 每个球域内采样64个点
                mlp=[input_feature_dim, 64, 64, 128],  # MLP的通道数