[Paper Reading]M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

最新推荐文章于 2024-09-21 00:06:25 发布

路飞DoD

最新推荐文章于 2024-09-21 00:06:25 发布

阅读量988

点赞数 23

文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/qq_51022848/article/details/136556836

版权

M2T2是一种Transformer模型，通过处理3D点云，能适应复杂场景中的不同对象，执行6-DoF抓取和放置任务。该模型利用场景的原始点云预测接触点和有效抓取器姿态，显著提升了在未知对象上的零迁移性能。它通过多层注意力机制和掩码Transformer，结合大规模合成数据集，提供了一种通用且灵活的机器人操作策略。

摘要由CSDN通过智能技术生成

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

[home page][paper]

随着LLM和大规模机器人数据集的出现，在object manipulation的高级决策取得了巨大进展。这些通用的模型使用自然语言命令来解释复杂的任务，但是由于不能使用低级的动作原语，它们难以推广到分布之外的对象。（泛化能力弱）

现有的task-specific模型擅长于未知对象的低级操作，但是仅仅适用于单一的动作类型。

M2T2，提供不同类型的低级操作，适用于复杂场景中的任意对象。这是一个transformer模型，它通过给定场景的原始点云来解释接触点和预测不同动作模式下的有效抓持器姿态。

最终效果：一个端到端的模型，能够应对不同场景中未知数量的对象，经过模型多次推理预测，提供一组可行的抓取或者放置方案。

Introduction

Language Models for high-level planning.

SayCan: Grounding Language in Robotic Affordances
task-specified model

设计一个统一的模型，既能够使用不同的动作原语，又能适用于多种多样的对象。
在这里插入图片描述

actions:

6-DoF grasping
placing

能够为低级的motion planner提供一组不同的目标姿态。

Contribution

其在成功率和输出多样性方面优于现有的最先进方法（ Contact-graspnet & Cabinet）；
为训练提供了一个大规模的合成数据集，包含130,000个杂乱场景，涉及8,800个不同的物体，并标注了用于拾取和放置的有效抓取姿势；
M2T2在拾取和放置分布外对象的零迁移表现，相较基线提高了约19%；
我们证明了M2T2在RLBench [paper]的子集上优于最先进的端到端方法[A multi-task transformer for robotic manipulation]，展示了它在解决具有语言目标的复杂任务中的潜力。

Action Modes

Object-centric 6-DoF Grasping

Input: 场景3D点云

Output: 一组对象抓取方案（6-DoF抓取姿态[3-DoF旋转 + 3-DoF平移]，即末端执行器的位姿）

Orientation-aware Placing

Input: 场景的3D点云以及待放置物体的部分3D点云

Output: 一组6-DoF放置姿势，指示末端执行器需要位于何处，以便在释放物体时，它将稳定地放置而不发生碰撞

Key idea: reason about contact points. 将拾取视为机器人使用空 gripper 与目标物体进行接触，将放置视为机器人使用 gripper 中的物体与表面进行接触。

Model Architecture

Scene Encoder

PointNet++

多尺度特征图：1/64、1/16、1/4和1倍（相较于input size）

Contact Decoder

是一种transformer模型，可以预测抓取和放置的接触点的位置。

在抓取方面，我们使用了[Concat-GraspNet]中的抓取表示，其中每个抓取都围绕物体上与抓取器在抓取时接触的可见点作为锚点，模型预测了指定抓取相对于接触点的相对变换的额外参数。我们通过将接触点定义为物体点云中心投影到桌面上的位置，将这种表示扩展到放置。
在这里插入图片描述
鉴于我们可以预测观察到的点是否是适合的抓取接触点，因此我们可以将6-DoF抓取学习问题简化为估计平行偏航夹持器的3-DoF抓取旋转 $R_g \in \mathbb{R}^{3 \times 3}$ 和抓取宽度 $\in \mathbb{R}$ 。

从接触点 $\in \mathbb{R}^3$ 开始，夹爪基线与网格相交，我们描述一个由 $(R_g, t_g) \in SE(3)$ 和抓取宽度 $\in \mathbb{R}$ 定义的6-DoF抓取姿势 $\in G$ ，如下：
$t_g = c + \frac{w}{2}b + da\quad(1) \\ R _ { g } = \left[ \begin{array} { c c c } { | } & { | } & { | } \\ { b } & { a \times b } & { a } \\ { | } & { | } & { | } \end{array} \right] \quad(2)$
其中， $\in \mathbb{R}^3$ ， $\|a\| = 1$ 是进入向量， $\in \mathbb{R}^3$ ， $\|b\| = 1$ 是抓取基线向量， $\in \mathbb{R}$ 是夹爪基线到夹爪基座的常数距离。抓取表示如上图所示。

因此，我们可以借鉴图像分割的最新见解。在我们的情况下，我们修改了[Masked Transformer] 以预测接触掩码。Transformer通过多个注意力层传递一组可学习的query tokens。来自场景编码器的多分辨率特征图通过不同层的交叉注意力传递。每一层的输出标记与场景编码器的每点特征图相乘，生成临时掩码。
$X_l = \text{softmax}(\mathcal{M}_{l-1} + Q_lK_l^T)V_l + X_{l-1}$

$\mathcal{M}_{l-1}(t, x, y) = \begin{cases} 0, & \text{if}\quad M_{l-1}(t, x, y) = 1 \\ -\infty, & \text{otherwise} \end{cases}$

这些临时掩码用于在下一层中屏蔽交叉注意力，以引导关注到相关区域（因此称为“masked Transformer”）。在最后一个注意力层之后，模型产生G个抓取掩码和P个放置掩码，其中G是可抓取物体的最大数量，P是放置方向的数量。

Objectness MLP

Object Encoder

Action Decoder

动作解码器是一个3层的MLP，接收来自场景编码器的每点特征图，并预测每个点的3D进入方向、3D接触方向和1D抓取宽度。这些预测值与接触点一起用于重构抓取姿势。

Loss

场景中的对象数量 $N$ 是未知的，设置一个很大的参数 $G$ 来表示grasp tokens，M2T2输出了 $G$ 个标量的物体性分数 $o_i$ 和 $G$ 个每点的掩码 $M_i^{\text{grasp}}$ 。

我们使用匈牙利匹配(Hungarian matching)来选择与ground truth最匹配的N个掩码。

首先，我们计算每个预测项 $(o_i, M_i^{\text{grasp}})$ 与ground truth掩码 $M_j^{\text{gt}}$ 之间的cost，具体公式如下：

$C_{ij} = 1 - o_i + BCE(M_i^{\text{pred}}, M_j^{\text{gt}}) + DICE(M_i^{\text{pred}}, M_j^{\text{gt}})$

其中：

$o_i$ 是objectness score，表示预测的物体性的分数。
BCE 表示二元交叉熵损失（Binary Cross Entropy Loss）。
DICE 表示DICE Loss（dice coefficient是一种用于评估两个样本的相似性的度量函数，取值0~1）

$L_{DICE}(Y_{\text{true}}, Y_{\text{pred}}) = 1-\frac{2 \times |Y_{\text{true}} \cap Y_{\text{pred}}|}{|Y_{\text{true}}| + |Y_{\text{pred}}|}$

现在，我们将匈牙利匹配应用于 $\times N$ 成本矩阵 $C$ ，以获得使总成本 $\sum_{j=1}^{N} C_{{m_{j}}j}$ 最小的索引集合 $\mathcal{M} = \{m_i\}$ 。然后，我们通过将所有匹配的标记设为正类，其他的标记设为负类，计算objectness loss。

$L_{\text{obj}} = \frac{1}{G} \sum_{i=1}^{G}- \left[ \mathbb{1}(i \in M) \log(o_i) + (1 - \mathbb{1}(i \in M)) \log(1 - o_i) \right]$

我们计算匹配的mask与ground truth之间的mask loss，具体公式如下：
$L_{\text{mask}} = \frac{1}{N} \sum_{j=1}^{N} \left[ \text{BCE}(M_{m_j}^{pred}, M_{j}^{gt}) + \text{DICE}(M_{m_j}^{pred}, M_{j}^{gt}) \right]$

Code

config

config.yaml
	- data
	- m2t2
		- scene_encoder
		- object_encoder
		- concat_decoder
		- action_decoder
		- matcher
		- grasp_loss
		- place_loss
	- optimizer
	- train
	- eval

M2T2

def init():
	backbone = PointNet2MSG.from_config(cfg.scene_encoder)
	object_encoder = PointNet2MSGCls.from_config(cfg.object_encoder)
	transformer = ContactDecoder.from_config(cfg.contact_decoder, channels, obj_channels)
	grasp_mlp = ActionDecoder.from_config(cfg.action_decoder, args['transformer'])
	set_criterion = SetCriterion.from_config(cfg.grasp_loss, matcher)
	grasp_criterion = GraspCriterion.from_config(cfg.grasp_loss)
	place_criterion = PlaceCriterion.from_config(cfg.place_loss)

def forward():
    scene_inputs[B,N,3+input_channels] -> Scene_Encoder() -> scene_outputs多尺度特征图[B, output_channels[i], N[i]]
    。。。

Scene Encoder

pointnet2

INPUT():
	pointcloud: Variable(torch.cuda.FloatTensor)
    shape: [B, N, 3 + input_channels]
    type: tensor
RESHPAE():
    xyz: [B, N, 3]
    features: [B, input_channels, N]

迭代执行SAmodules：用于进行迭代采样点云的局部结构,不断的增大感受野,用以进行点云的特征提取.

l_xyz, l_features, sample_ids = [xyz], [features], []
for i in range(len(self.SA_modules)):
    li_xyz, _, li_features, sample_idx = self.SA_modules[i](
        l_xyz[i], l_features[i] # 输入数据(xyz、features)或者上一层SA的输出xyz、features
    )
    l_xyz.append(li_xyz)
    l_features.append(li_features)
    if sample_idx[0] is not None: # 检查是否存在采样点索引
        sample_ids.append(sample_idx[0])

上采样层FPmodules：上采样层基于采集的点云特征进行物体的分割.

for i in range(-1, -(len(self.FP_modules) + 1), -1): # 逆序迭代
    l_features[i - 1] = self.FP_modules[i](
        l_xyz[i - 1], l_xyz[i], l_features[i - 1], l_features[i] # 下一层和当前层输入
    )

SA_modules实现

基类：

class _PointnetSAModuleBase(nn.Module):
    def __init__(self):
        super(_PointnetSAModuleBase, self).__init__()
        self.npoint = None
        self.groupers = None
        self.mlps = None

    def forward(
        self, xyz: torch.Tensor, features: Optional[torch.Tensor]
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        r"""
        Parameters
        ----------
        xyz : torch.Tensor
            (B, N, 3) tensor of the xyz coordinates of the features
        features : torch.Tensor
            (B, C, N) tensor of the descriptors of the the features

        Returns
        -------
        new_xyz : torch.Tensor
            (B, npoint, 3) tensor of the new features' xyz
        new_features : torch.Tensor
            (B,  \sum_k(mlps[k][-1]), npoint) tensor of the new_features descriptors
        sample_ids : torch.Tensor
            list of (B, npoint, nsample) points indices from ball queries
        """
        if self.npoint is not None:
            new_xyz_idx = furthest_point_sample(xyz, self.npoint)
            new_xyz = (
                gather_operation(
                    xyz.transpose(1, 2).contiguous(), new_xyz_idx
                ).transpose(1, 2).contiguous()
            )
        else:
            new_xyz_idx = torch.zeros_like(xyz[:, :1, 0]).long()
            new_xyz = torch.zeros_like(xyz[:, :1])

        new_features_list, sample_ids = [], []
        for i in range(len(self.groupers)):
            new_features, sample_idx = self.groupers[i](
                xyz, new_xyz, features
            )  # (B, C, npoint, nsample)

            new_features = self.mlps[i](new_features)  # (B, mlp[-1], npoint, nsample)
            new_features = new_features.max(dim=-1)[0]  # (B, mlp[-1], npoint)

            new_features_list.append(new_features)
            sample_ids.append(sample_idx)
        features = torch.cat(new_features_list, dim=1)

        return new_xyz, new_xyz_idx, features, sample_ids

class PointnetSAModuleMSG(_PointnetSAModuleBase):
    r"""Pointnet set abstrction layer with multiscale grouping

    Parameters
    ----------
    npoint : int
        Number of features 
    radii : list of float32
        list of radii to group with
    nsamples : list of int32
        Number of samples in each ball query
    mlps : list of list of int32
        Spec of the pointnet before the global max_pool for each scale
    norm : str
        Type of normalization layer (BN/GN)
    """

    def __init__(self, npoint, radii, nsamples, mlps, norm='BN', use_xyz=True):
        super(PointnetSAModuleMSG, self).__init__()

        assert len(radii) == len(nsamples) == len(mlps)

        self.npoint = npoint
        self.groupers = nn.ModuleList()
        self.mlps = nn.ModuleList()
        for i in range(len(radii)):
            radius = radii[i]
            nsample = nsamples[i]
            self.groupers.append(
                QueryAndGroup(radius, nsample, use_xyz=use_xyz)
                if npoint is not None
                else GroupAll(use_xyz)
            )
            mlp_spec = mlps[i]
            if use_xyz:
                mlp_spec[0] += 3

            self.mlps.append(build_shared_mlp(mlp_spec, norm))

实现：

self.use_rgb = use_rgb
c_in = 3 if use_rgb else 0
num_points = num_points // downsample
self.SA_modules.append(
    PointnetSAModuleMSG(
        npoint=num_points,
        radii=[radius, radius * radius_mult],
        nsamples=[16, 32],
        mlps=[[c_in, 32, 32, 64], [c_in, 32, 32, 64]],
        norm=norm
    )
)
c_out_0 = 64 + 64
radius = radius * radius_mult

num_points = num_points // downsample
self.SA_modules.append(
    PointnetSAModuleMSG(
        npoint=num_points,
        radii=[radius, radius * radius_mult],
        nsamples=[16, 32],
        mlps=[[c_out_0, 64, 64, 128], [c_out_0, 64, 64, 128]],
        norm=norm
    )
)
c_out_1 = 128 + 128
radius = radius * radius_mult

num_points = num_points // downsample
self.SA_modules.append(
    PointnetSAModuleMSG(
        npoint=num_points,
        radii=[radius, radius * radius_mult],
        nsamples=[16, 32],
        mlps=[[c_out_1, 128, 128, 256], [c_out_1, 128, 128, 256]],
        norm=norm
    )
)
c_out_2 = 256 + 256
radius = radius * radius_mult

num_points = num_points // downsample
self.SA_modules.append(
    PointnetSAModuleMSG(
        npoint=num_points,
        radii=[radius, radius * radius_mult],
        nsamples=[16, 32],
        mlps=[[c_out_2, 256, 256, 512], [c_out_2, 256, 256, 512]],
        norm=norm
    )
)
c_out_3 = 512 + 512

FP_modules实现

self.FP_modules.append(
    PointnetFPModule(mlp=[256 + c_in, 128, 128])
)
self.FP_modules.append(
    PointnetFPModule(mlp=[512 + c_out_0, 256, 256])
)
self.FP_modules.append(
    PointnetFPModule(mlp=[512 + c_out_1, 512, 512])
)
self.FP_modules.append(
    PointnetFPModule(mlp=[c_out_3 + c_out_2, 512, 512])
)

self.out_channels = {
    'res0': 128, 'res1': 256, 'res2': 512, 'res3': 512, 'res4': 1024
}