MVX-net3D算法笔记

__YOLO__0

已于 2023-12-20 15:51:18 修改

阅读量1.1k

点赞数 23

文章标签： 3d 算法笔记

于 2023-12-14 20:16:24 首次发布

本文链接：https://blog.csdn.net/weixin_41922853/article/details/132230992

版权

本文为个人学习过程中所记录笔记，便于梳理思路和后续查看用，如有错误，感谢批评指正！
参考：
paper：
code：

Abstract：

采用Pointfusion 和VoxelFusion实现了相机和点云的早融合。在KITTI数据集上包括5类别的鸟瞰数据和3D检测数据中获得前2名的数据。
在这里插入图片描述

I. INTRODUCTION

目前做3D检测有常见的两种思路：（1）将3D点云转换成手工特征，比如BEVmap，然后采用2DCNN的方法进行检测和分类，该方法收到量化的影响，当目标较少，上面的点云较少时，性能下降严重。（2）直接采用3DCNN对三维点云进行处理，该方法所需内存太大，存在计算瓶颈。
VoxelNet的提出，大大提升了对于点云的处理效率。
本文中，将VoxelNet扩展到了多模态，将点云和图像的语义特征在早期进行融合，具体有两种融合方法：
（1）PointFusion：将2D图像特征提取器提取图像特征，将原始点云投影到图像上，提取有点云对应的位置的图像特征，然后维度处理以后和点云特征直接相加融合，最后将结果输入VoxelNet进行处理。
（2）VoxelFusion:采用voxelnet生成3D voxels，然后投影到图像，然后针对每个投影后的voxel采用与训练的CNN进行特征提取。与Pointfusion 相比，voxelfusion是一个相对的后融合技术。

II. RELATED WORK

III. PROPOSED METHOD

PointFusion or VoxelFusion是选其一进行采用的。
PointFusion：将原始点云投影到图像上，然后和图像一起输入2D预训练特征提取器。
VoxelFusion：将voxel网格化后非空的结果投影到图像上，然后再一起输入2D特征提取器。

2D Detection Network
采Faster rcnn框架提取特征。VGG16骨干。
B. VoxelNet
包括VFE、卷积中间层和3DRPN。
VFE解码在独立的voxel水平的原始点云，VFE全连接层。详细见点云处理方式笔记
C. Multimodal Fusion
PointFusion: 见后续代码分析。
VoxelFusion:非空的voxel投影到图像上产生2D的ROI，然后进行ROI pooling。相比于pointfusion，内存需求更低，速度更快，并且更容易通过投影所有voxel的方式扩展，使得更多利用图像特征，避免点云覆盖不到的目标物漏检的情况。（遗憾的是，该方法暂无代码实现，可能因为该方法在论文中指标更低的缘故。）

D. Training Details
在VoxelFusion中，将所有的voxel都投影到图像上能够更好的处理远距离目标的检测。
测试了将原始图片直接投影到图像上的效果不如经过CNN提取特征后投影的效果。

代码分析：参考mmdetection3d框架，PointFusion方法。
模型部分代码整体结构如下：

def forward(self,
                inputs: Union[dict, List[dict]],
                data_samples: OptSampleList = None,
                mode: str = 'tensor',
                **kwargs) -> ForwardResults:
        """The unified entry for a forward process in both training and test.

        The method should accept three modes: "tensor", "predict" and "loss":

        - "tensor": Forward the whole network and return tensor or tuple of
        tensor without any post-processing, same as a common nn.Module.
        - "predict": Forward and return the predictions, which are fully
        processed to a list of :obj:`Det3DDataSample`.
        - "loss": Forward and return a dict of losses according to the given
        inputs and data samples.

        Note that this method doesn't handle neither back propagation nor
        optimizer updating, which are done in the :meth:`train_step`.

        Args:
            inputs  (dict | list[dict]): When it is a list[dict], the
                outer list indicate the test time augmentation. Each
                dict contains batch inputs
                which include 'points' and 'imgs' keys.

                - points (list[torch.Tensor]): Point cloud of each sample.
                - imgs (torch.Tensor): Image tensor has shape (B, C, H, W).
            data_samples (list[:obj:`Det3DDataSample`],
                list[list[:obj:`Det3DDataSample`]], optional): The
                annotation data of every samples. When it is a list[list], the
                outer list indicate the test time augmentation, and the
                inter list indicate the batch. Otherwise, the list simply
                indicate the batch. Defaults to None.
            mode (str): Return what kind of value. Defaults to 'tensor'.

        Returns:
            The return type depends on ``mode``.

            - If ``mode="tensor"``, return a tensor or a tuple of tensor.
            - If ``mode="predict"``, return a list of :obj:`Det3DDataSample`.
            - If ``mode="loss"``, return a dict of tensor.
        """
        if mode == 'loss':
            return self.loss(inputs, data_samples, **kwargs)
        elif mode == 'predict':
            if isinstance(data_samples[0], list):
                # aug test
                assert len(data_samples[0]) == 1, 'Only support ' \
                                                  'batch_size 1 ' \
                                                  'in mmdet3d when ' \
                                                  'do the test' \
                                                  'time augmentation.'
                return self.aug_test(inputs, data_samples, **kwargs)
            else:
                return self.predict(inputs, data_samples, **kwargs)
        elif mode == 'tensor':
            return self._forward(inputs, data_samples, **kwargs)
        else:
            raise RuntimeError(f'Invalid mode "{mode}". '
                               'Only supports loss, predict and tensor mode')

分为训练和推理两种模式，两种模式的通用的第一步均是特征提取，主要包括图像特征提取和点云特征提取。以推理过程为例：

def predict(self, batch_inputs_dict: Dict[str, Optional[Tensor]],
                batch_data_samples: List[Det3DDataSample],
                **kwargs) -> List[Det3DDataSample]:
        """Forward of testing.

        Args:
            batch_inputs_dict (dict): The model input dict which include
                'points' keys.

                - points (list[torch.Tensor]): Point cloud of each sample.
            batch_data_samples (List[:obj:`Det3DDataSample`]): The Data
                Samples. It usually includes information such as
                `gt_instance_3d`.

        Returns:
            list[:obj:`Det3DDataSample`]: Detection results of the
            input sample. Each Det3DDataSample usually contain
            'pred_instances_3d'. And the ``pred_instances_3d`` usually
            contains following keys.

            - scores_3d (Tensor): Classification scores, has a shape
                (num_instances, )
            - labels_3d (Tensor): Labels of bboxes, has a shape
                (num_instances, ).
            - bbox_3d (:obj:`BaseInstance3DBoxes`): Prediction of bboxes,
                contains a tensor with shape (num_instances, 7).
        """
        batch_input_metas = [item.metainfo for item in batch_data_samples]
        img_feats, pts_feats = self.extract_feat(batch_inputs_dict,
                                                 batch_input_metas)
        if pts_feats and self.with_pts_bbox:
            results_list_3d = self.pts_bbox_head.predict(
                pts_feats, batch_data_samples, **kwargs)
        else:
            results_list_3d = None

        if img_feats and self.with_img_bbox:
            # TODO check this for camera modality
            results_list_2d = self.predict_imgs(img_feats, batch_data_samples,
                                                **kwargs)
        else:
            results_list_2d = None

        detsamples = self.add_pred_to_datasample(batch_data_samples,
                                                 results_list_3d,
                                                 results_list_2d)
        return detsamples

调用函数：img_feats, pts_feats = self.extract_feat(batch_inputs_dict, batch_input_metas)下面将分别做介绍。
首先图像特征提取模块，采用FASTERNCNN结构，resnet50提取特征，然后采用FPN作为neck，调用函数img_feats = self.extract_img_feat(imgs, batch_input_metas)

def extract_img_feat(self, img: Tensor, input_metas: List[dict]) -> dict:
        """Extract features of images."""
        if self.with_img_backbone and img is not None:
            input_shape = img.shape[-2:]
            # update real input shape of each single img
            for img_meta in input_metas:
                img_meta.update(input_shape=input_shape)

            if img.dim() == 5 and img.size(0) == 1:
                img.squeeze_()
            elif img.dim() == 5 and img.size(0) > 1:
                B, N, C, H, W = img.size()
                img = img.view(B * N, C, H, W)
            img_feats = self.img_backbone(img) # backbone采用resnet50
        else:
            return None
        if self.with_img_neck:
            img_feats = self.img_neck(img_feats) #NECK采用FPN网络
        return img_feats
        """
            img_feats[0].shape: ([1, 256, 176, 232])
            img_feats[1].shape: ([1, 256, 88, 116])
            img_feats[2].shape: ([1, 256, 44, 58])
            img_feats[3].shape: ([1, 256, 22, 29])
            img_feats[4].shape: ([1, 256, 11, 15])
            """

点云特征提取与图像点云融合模块，调用函数：pts_feats = self.extract_pts_feat(voxel_dict, points=points, img_feats=img_feats, batch_input_metas=batch_input_metas)

def extract_pts_feat(
            self,
            voxel_dict: Dict[str, Tensor],
            points: Optional[List[Tensor]] = None,
            img_feats: Optional[Sequence[Tensor]] = None,
            batch_input_metas: Optional[List[dict]] = None
    ) -> Sequence[Tensor]:
        """Extract features of points.

        Args:
            voxel_dict(Dict[str, Tensor]): Dict of voxelization infos.
            points (List[tensor], optional):  Point cloud of multiple inputs.
            img_feats (list[Tensor], tuple[tensor], optional): Features from
                image backbone.
            batch_input_metas (list[dict], optional): The meta information
                of multiple samples. Defaults to True.

        Returns:
            Sequence[tensor]: points features of multiple inputs
            from backbone or neck.
        """
        if not self.with_pts_bbox:
            return None
        voxel_features, feature_coors = self.pts_voxel_encoder(
            voxel_dict['voxels'], voxel_dict['coors'], points, img_feats,
            batch_input_metas) # torch.Size([11986, 128]) torch.Size([11986, 4])# 见类DynamicVFE，完成点云特征处理以及融合
        batch_size = voxel_dict['coors'][-1, 0] + 1
        x = self.pts_middle_encoder(voxel_features, feature_coors, batch_size) # torch.Size([1, 256, 200, 150])
        x = self.pts_backbone(x) # 2x5个2D卷积层 输出为两个特征图，分别为torch.Size([1, 128, 200, 150])torch.Size([1, 256, 100, 75])
        if self.with_pts_neck:
            x = self.pts_neck(x) # 采用反卷积对齐连个特征图为torch.Size([1, 256, 200, 150])，最后concat torch.Size([1, 512, 200, 150])
        return x

点云特征处理以及融合模块，调用函数：self.pts_voxel_encoder(voxel_dict['voxels'], voxel_dict['num_points'], voxel_dict['coors'], img_feats, batch_input_metas)

见类DynamicVFE：
def forward(self,
                features: Tensor,
                coors: Tensor,
                points: Optional[Sequence[Tensor]] = None,
                img_feats: Optional[Sequence[Tensor]] = None,
                img_metas: Optional[dict] = None,
                *args,
                **kwargs) -> tuple:
        """Forward functions.
        self.pts_voxel_encoder(
            voxel_dict['voxels'], voxel_dict['coors'], points, img_feats,
            batch_input_metas)
        Args:
            features (torch.Tensor): Features of voxels, shape is NxC.
            coors (torch.Tensor): Coordinates of voxels, shape is  Nx(1+NDim).
            points (list[torch.Tensor], optional): Raw points used to guide the
                multi-modality fusion. Defaults to None.
            img_feats (list[torch.Tensor], optional): Image features used for
                multi-modality fusion. Defaults to None.
            img_metas (dict, optional): [description]. Defaults to None.

        Returns:
            tuple: If `return_point_feats` is False, returns voxel features and
                its coordinates. If `return_point_feats` is True, returns
                feature of each points inside voxels.
        """
        features_ls = [features] # features is just points
        # Find distance of x, y, and z from cluster center
        if self._with_cluster_center: # True
            voxel_mean, mean_coors = self.cluster_scatter(features, coors)#torch.Size([11986, 4]) 
            points_mean = self.map_voxel_center_to_point(
                coors, voxel_mean, mean_coors) 
            # TODO: maybe also do cluster for reflectivity
            f_cluster = features[:, :3] - points_mean[:, :3]
            features_ls.append(f_cluster) # 加入去中心点后的特征

        # Find distance of x, y, and z from pillar center
        if self._with_voxel_center:
            f_center = features.new_zeros(size=(features.size(0), 3))
            f_center[:, 0] = features[:, 0] - (
                coors[:, 3].type_as(features) * self.vx + self.x_offset)
            f_center[:, 1] = features[:, 1] - (
                coors[:, 2].type_as(features) * self.vy + self.y_offset)
            f_center[:, 2] = features[:, 2] - (
                coors[:, 1].type_as(features) * self.vz + self.z_offset)
            features_ls.append(f_center)# 加入去pillar中心点后的特征

        if self._with_distance:
            points_dist = torch.norm(features[:, :3], 2, 1, keepdim=True)
            features_ls.append(points_dist)

        # Combine together feature decorations
        features = torch.cat(features_ls, dim=-1) # torch.Size([23878, 10])
        for i, vfe in enumerate(self.vfe_layers):
            point_feats = vfe(features) # 全连接 + ReLU  # 进入融合层是torch.Size([23878, 64])
            if (i == len(self.vfe_layers) - 1 and self.fusion_layer is not None
                    and img_feats is not None):
                point_feats = self.fusion_layer(img_feats, points, point_feats,
                                                img_metas) # 融合 #torch.Size([23878, 128])
            voxel_feats, voxel_coors = self.vfe_scatter(point_feats, coors) #voxel 化
            if i != len(self.vfe_layers) - 1:
                # need to concat voxel feats if it is not the last vfe
                feat_per_point = self.map_voxel_center_to_point(
                    coors, voxel_feats, voxel_coors)
                features = torch.cat([point_feats, feat_per_point], dim=1)

        if self.return_point_feats:
            return point_feats
        return voxel_feats, voxel_coors

融合层，调用函数：point_feats = self.fusion_layer(img_feats, points, point_feats, img_metas) # 最后一层开始融合

见类PointFusion：
def forward(self, img_feats: List[Tensor], pts: List[Tensor],
                pts_feats: Tensor, img_metas: List[dict]) -> Tensor:
        """Forward function.

        Args:
            img_feats (List[Tensor]): Image features.
            pts: (List[Tensor]): A batch of points with shape N x 3.
            pts_feats (Tensor): A tensor consist of point features of the
                total batch.
            img_metas (List[dict]): Meta information of images.

        Returns:
            Tensor: Fused features of each point.
        """
        # pts_feats.shape = torch.Size([23878, 64])
        # 利用点云在图像上的对应坐标， 去各level特征图中采样出和点云点数N对应的点。这个过程是points级别。
        img_pts = self.obtain_mlvl_feats(img_feats, pts, img_metas) # torch.Size([23878, 640])
        img_pre_fuse = self.img_transform(img_pts) # 全连接 + BN torch.Size([23878, 128])
        if self.training and self.dropout_ratio > 0:
            img_pre_fuse = F.dropout(img_pre_fuse, self.dropout_ratio)
        pts_pre_fuse = self.pts_transform(pts_feats) # 全连接 + BN torch.Size([23878, 128])

        fuse_out = img_pre_fuse + pts_pre_fuse # 直接将两者特征图相加融合
        if self.activate_out:
            fuse_out = F.relu(fuse_out)
        if self.fuse_out: # false
            fuse_out = self.fuse_conv(fuse_out)

        return fuse_out #torch.Size([23878, 128])

融合后的特征输入稀疏卷积，调用函数： x = self.pts_middle_encoder(voxel_features, voxel_dict['coors'], batch_size)，

见类SparseEncoder：
 def forward(self, voxel_features: Tensor, coors: Tensor,
                batch_size: int) -> Union[Tensor, Tuple[Tensor, list]]:
        """Forward of SparseEncoder.

        Args:
            voxel_features (torch.Tensor): Voxel features in shape (N, C).
            coors (torch.Tensor): Coordinates in shape (N, 4),
                the columns in the order of (batch_idx, z_idx, y_idx, x_idx).
            batch_size (int): Batch size.

        Returns:
            torch.Tensor | tuple[torch.Tensor, list]: Return spatial features
                include:

            - spatial_features (torch.Tensor): Spatial features are out from
                the last layer.
            - encode_features (List[SparseConvTensor], optional): Middle layer
                output features. When self.return_middle_feats is True, the
                module returns middle features.
        """
        # voxel_features.shape torch.Size([11986, 128]) coors.shape torch.Size([11986, 4])
        coors = coors.int()
        input_sp_tensor = SparseConvTensor(voxel_features, coors,
                                           self.sparse_shape, batch_size)  # 根据voxel特征和voxel坐标以及空间形状和batch，建立稀疏tensor
        x = self.conv_input(input_sp_tensor) # 子流线稀疏卷积+BN+Relu

        encode_features = []
        for encoder_layer in self.encoder_layers:
            x = encoder_layer(x)
            encode_features.append(x)

        # for detection head
        # [200, 176, 5] -> [200, 176, 2]
        out = self.conv_out(encode_features[-1])
        spatial_features = out.dense() # torch.Size([1, 128, 2, 200, 150])

        N, C, D, H, W = spatial_features.shape
        spatial_features = spatial_features.view(N, C * D, H, W) # torch.Size([1, 256, 200, 150])

        if self.return_middle_feats:
            return spatial_features, encode_features
        else:
            return spatial_features # torch.Size([1, 256, 200, 150])

将稀疏卷积处理后的融合特征输入second网络处理，调用函数：x = self.pts_backbone(x)

类SECOND:
def forward(self, x: Tensor) -> Tuple[Tensor, ...]:
        """Forward function.

        Args:
            x (torch.Tensor): Input with shape (N, C, H, W).

        Returns:
            tuple[torch.Tensor]: Multi-scale features.
        """
        outs = []
        for i in range(len(self.blocks)): 
            x = self.blocks[i](x)
            outs.append(x)
        return tuple(outs)# 2x5个2D卷积层 输出为两个特征图，分别为torch.Size([1, 128, 200, 150])torch.Size([1, 256, 100, 75])

接着送入SECONDFPN网络：调用函数：if self.with_pts_neck: x = self.pts_neck(x)

见类SECONDFPN:
def forward(self, x):
        """Forward function.

        Args:
            x (List[torch.Tensor]): Multi-level features with 4D Tensor in
                (N, C, H, W) shape.

        Returns:
            list[torch.Tensor]: Multi-level feature maps.
        """
        assert len(x) == len(self.in_channels)
        ups = [deblock(x[i]) for i, deblock in enumerate(self.deblocks)] # 反卷积操作，把两个特征图分辨率对齐为torch.Size([1, 128, 200, 150])

        if len(ups) > 1:
            out = torch.cat(ups, dim=1)
        else:
            out = ups[0]
        return [out] # torch.Size([1, 512, 200, 150])

至此，我们完成了图像特征提取，点云特征提取、点云特征图像特征融合几个过程，得到了img_feats, pts_feats两个输出。数据维度如下：

img_feats, pts_feats = self.extract_feat(batch_inputs_dict, batch_input_metas)
"""
img_feats[0].shape: ([1, 256, 176, 232])
img_feats[1].shape: ([1, 256, 88, 116])
img_feats[2].shape: ([1, 256, 44, 58])
img_feats[3].shape: ([1, 256, 22, 29])
img_feats[4].shape: ([1, 256, 11, 15])

pts_feats[0].shape: torch.Size([1, 512, 200, 150])
"""

当执行前向推理预测时，调用：

def predict(self, batch_inputs_dict: Dict[str, Optional[Tensor]],
                batch_data_samples: List[Det3DDataSample],
                **kwargs) -> List[Det3DDataSample]:
        """Forward of testing.

        Args:
            batch_inputs_dict (dict): The model input dict which include
                'points' keys.

                - points (list[torch.Tensor]): Point cloud of each sample.
            batch_data_samples (List[:obj:`Det3DDataSample`]): The Data
                Samples. It usually includes information such as
                `gt_instance_3d`.

        Returns:
            list[:obj:`Det3DDataSample`]: Detection results of the
            input sample. Each Det3DDataSample usually contain
            'pred_instances_3d'. And the ``pred_instances_3d`` usually
            contains following keys.

            - scores_3d (Tensor): Classification scores, has a shape
                (num_instances, )
            - labels_3d (Tensor): Labels of bboxes, has a shape
                (num_instances, ).
            - bbox_3d (:obj:`BaseInstance3DBoxes`): Prediction of bboxes,
                contains a tensor with shape (num_instances, 7).
        """
        batch_input_metas = [item.metainfo for item in batch_data_samples]
        img_feats, pts_feats = self.extract_feat(batch_inputs_dict,
                                                 batch_input_metas)
        if pts_feats and self.with_pts_bbox: # false
            results_list_3d = self.pts_bbox_head.predict(
                pts_feats, batch_data_samples, **kwargs)
        else:
            results_list_3d = None

        if img_feats and self.with_img_bbox:
            # TODO check this for camera modality
            results_list_2d = self.predict_imgs(img_feats, batch_data_samples,
                                                **kwargs)
        else:
            results_list_2d = None

        detsamples = self.add_pred_to_datasample(batch_data_samples,
                                                 results_list_3d,
                                                 results_list_2d)
        return detsamples

点云特征进入pts_bbox头，调用函数：if pts_feats and self.with_pts_bbox: results_list_3d = self.pts_bbox_head.predict( pts_feats, batch_data_samples, **kwargs)

见类Anchor3DHead：
def predict(self,
                x: Tuple[Tensor],
                batch_data_samples: SampleList,
                rescale: bool = False) -> InstanceList:
        """Perform forward propagation of the 3D detection head and predict
        detection results on the features of the upstream network.

        Args:
            x (tuple[Tensor]): Multi-level features from the
                upstream network, each is a 4D-tensor.
            batch_data_samples (List[:obj:`Det3DDataSample`]): The Data
                Samples. It usually includes information such as
                `gt_instance_3d`, `gt_pts_panoptic_seg` and
                `gt_pts_sem_seg`.
            rescale (bool, optional): Whether to rescale the results.
                Defaults to False.

        Returns:
            list[:obj:`InstanceData`]: Detection results of each sample
            after the post process.
            Each item usually contains following keys.

            - scores_3d (Tensor): Classification scores, has a shape
              (num_instances, )
            - labels_3d (Tensor): Labels of bboxes, has a shape
              (num_instances, ).
            - bboxes_3d (BaseInstance3DBoxes): Prediction of bboxes,
              contains a tensor with shape (num_instances, C), where
              C >= 7.
        """
        batch_input_metas = [
            data_samples.metainfo for data_samples in batch_data_samples
        ]
        outs = self(x) # return multi_apply(self.forward_single, x)->return tuple(map(list, zip(*map_results)))
        # 返回值为([cls_score], [bbox_pred], [dir_cls_pred])
        predictions = self.predict_by_feat(
            *outs, batch_input_metas=batch_input_metas, rescale=rescale) # rescale = false 一堆后处理，有anchor生成等，后续需要细看。
        return predictions

图像特征进入图像头：源代码中没有图像头。
最后得出结果，调用函数：detsamples = self.add_pred_to_datasample(batch_data_samples, results_list_3d, results_list_2d)

def add_pred_to_datasample(
        self,
        data_samples: SampleList,
        data_instances_3d: OptInstanceList = None,
        data_instances_2d: OptInstanceList = None,
    ) -> SampleList:
        """Convert results list to `Det3DDataSample`.

        Subclasses could override it to be compatible for some multi-modality
        3D detectors.

        Args:
            data_samples (list[:obj:`Det3DDataSample`]): The input data.
            data_instances_3d (list[:obj:`InstanceData`], optional): 3D
                Detection results of each sample.
            data_instances_2d (list[:obj:`InstanceData`], optional): 2D
                Detection results of each sample.

        Returns:
            list[:obj:`Det3DDataSample`]: Detection results of the
            input. Each Det3DDataSample usually contains
            'pred_instances_3d'. And the ``pred_instances_3d`` normally
            contains following keys.

            - scores_3d (Tensor): Classification scores, has a shape
              (num_instance, )
            - labels_3d (Tensor): Labels of 3D bboxes, has a shape
              (num_instances, ).
            - bboxes_3d (Tensor): Contains a tensor with shape
              (num_instances, C) where C >=7.

            When there are image prediction in some models, it should
            contains  `pred_instances`, And the ``pred_instances`` normally
            contains following keys.

            - scores (Tensor): Classification scores of image, has a shape
              (num_instance, )
            - labels (Tensor): Predict Labels of 2D bboxes, has a shape
              (num_instances, ).
            - bboxes (Tensor): Contains a tensor with shape
              (num_instances, 4).
        """

        assert (data_instances_2d is not None) or \
               (data_instances_3d is not None),\
               'please pass at least one type of data_samples'

        if data_instances_2d is None: # 赋了一个空值
            data_instances_2d = [
                InstanceData() for _ in range(len(data_instances_3d))
            ]
        if data_instances_3d is None:
            data_instances_3d = [
                InstanceData() for _ in range(len(data_instances_2d))
            ]

        for i, data_sample in enumerate(data_samples):
            data_sample.pred_instances_3d = data_instances_3d[i]
            data_sample.pred_instances = data_instances_2d[i]
        return data_samples