械臂论文笔记（四）【2020抓取检测】Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network

最新推荐文章于 2024-10-11 17:31:52 发布

CSPhD-winston-杨帆

最新推荐文章于 2024-10-11 17:31:52 发布

阅读量3.9k

点赞数 11

文章标签：深度学习机器学习神经网络

本文链接：https://blog.csdn.net/WhiffeYF/article/details/111059403

版权

本文提出了一种模块化机器人系统，能从场景图像中生成和执行对跖点抓取未知物体。采用新型生成残差卷积神经网络，实现实时速度下的鲁棒抓取生成。在标准数据集上取得了先进水平的准确率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

摘要

Abstract— In this paper, we present a modular robotic system to tackle the problem of generating and performing antipodal robotic grasps for unknown objects from the n-channel image of the scene. We propose a novel Generative Residual Convolutional Neural Network (GR-ConvNet) model that can generate robust antipodal grasps from n-channel input at real-time speeds (∼20ms). We evaluate the proposed model architecture on standard datasets and a diverse set of household objects. We achieved state-of-the-art accuracy of 97.7% and 94.6% on Cornell and Jacquard grasping datasets, respectively. We also demonstrate a grasp success rate of 95.4% and 93% on household and adversarial objects, respectively, using a 7 DoF robotic arm.

摘要：本文提出了一个模块化机器人系统来解决从场景的 n-通道图像中生成和执行对跖点机器人抓取未知目标的问题。我们提出了一种新的生成残差卷积神经网络(GR - convNet)模型，该模型可以在实时速度(∼20ms)下从 n-通道输入生成鲁棒的对足抓取。我们在标准数据集和一组不同的家庭对象上评估提出的模型架构。在Cornell和Jacquard抓取数据集上，我们分别达到了最先进的97.7%和94.6%的精度。我们还演示了使用7自由度机械臂在家用和对抗性物体上的抓握成功率分别为95.4%和93%。

一，介绍

I. INTRODUCTION

Robotic manipulators are constantly compared to humans due to the inherent characteristics of humans to instinctively grasp an unknown object rapidly and with ease based on their own experiences. As increasing research is being done to make the robots more intelligent, there exists a demand for a generalized technique to infer fast and robust grasps for any kind of object that the robot encounters. The major challenge is being able to precisely transfer the knowledge that the robot learns to novel real-world objects.

由于人的固有特性，机械手经常被拿来与人比较，根据自身的经验，本能地快速、轻松地抓住未知的物体。随着越来越多的研究使机器人更加智能，需要一种通用的技术来快速和稳健地推断机器人遇到的任何类型的物体。主要的挑战是如何精确地将机器人学到的知识转移到新的真实世界的物体上。

We present a modular robot agnostic approach to tackle this problem of grasping unknown objects. We propose a Generative Residual Convolutional Neural Network (GR ConvNet) that generates antipodal grasps for every pixel in an n-channel input image. We use the term generative to distinguish our method from other techniques that output a grasp probability or classify grasp candidates in order to predict the best grasp.

我们提出了一种模块化机器人不可知的方法来解决抓取未知对象的问题。我们提出了一种生成残差卷积神经网络(GR ConvNet)，它可以为n通道输入图像中的每个像素生成对跖点抓取。我们使用术语生成来区分我们的方法与其他技术，输出一个把握概率或分类抓取候选，以预测最佳抓取。

Unlike the previous work done in robotic grasping [1], [2], [3], [4], where the required grasp is predicted as a grasp rectangle calculated by choosing the best grasp from multiple grasp probabilities, our network generates three images from which we can infer grasp rectangles for multiple objects. Additionally, it is possible to infer multiple grasp rectangles for multiple objects from the output of GR-ConvNet in one shot thereby decreasing the overall computational time.

与之前机器人抓取[1]、[2]、[3]、[4]的工作不同，在之前的工作中，通过从多个抓取概率中选择最佳的抓取，预测所需抓取为一个抓取矩形，我们的网络生成三幅图像，从中我们可以推断出多个物体的抓取矩形。此外,它可以推断出多个抓取矩形的多个对象的输出在一个 GR-ConvNet从而减少总的计算时间。

Fig.1 shows an overview of the proposed system architecture. It consists of two main modules: the inference module and the control module. The inference module acquires RGB and aligned depth images of the scene from the RGB-D camera. The images are pre-processed to match the input format of the GR-ConvNet. The network generates quality,angle, and width images, which are then used to infer antipodal grasp poses. The control module consists of a task controller that prepares and executes a plan to perform a pick and place task using the grasp pose generated by the inference module. It communicates the required actions to the robot through a ROS interface using a trajectory planner and controller.

图1显示了系统架构的概述。它由两个主要模块组成:推理模块和控制模块。推理模块从RGB- d相机中获取场景的RGB和对齐深度图像。图像经过预处理，以匹配gri - convnet的输入格式。该网络生成质量、角度和宽度图像，然后用于推断对跖点抓取姿势。控制模块由任务控制器组成，该任务控制器使用推理模块生成的抓取位姿准备并执行执行抓取和放置任务的计划。它通过使用轨迹规划器和控制器的ROS接口向机器人传达所需的动作。

在这里插入图片描述

Fig. 1: Proposed system overview. Inference module predict suitable grasp poses for the objects in the camera’s field of view. Control module uses these grasp poses to plan and execute robot trajectories to perform antipodal grasps. Video: https://youtu.be/cwlEhdoxY4U

提出了系统的概述。推理模块预测合适的抓取姿态的对象在相机的视野。控制模块利用这些抓取姿态来规划和执行机器人的轨迹来执行对跖点抓取。视频

The main contributions of this paper can be summarized as follows:

1)We present a modular robotic system that predicts, plans, and performs antipodal grasps for the objects in the scene. We open-sourced the implementation of the proposed inference1 and control2 modules.

2)We propose a novel generative residual convolutional neural network architecture that predicts suitable antipodal grasp configurations for objects in the camera’s field of view.

3) We evaluate our model on publicly available grasping datasets and achieved state-of-the-art accuracy of 97.7% and 94.6% on Cornell and Jacquard grasping datasets, respectively.

4)We demonstrate that the proposed model can be deployed on a robotic arm to perform antipodal grasps at real-time speeds with a success rate of 95.4% and 93% on household and adversarial objects, respectively.

本文的主要贡献可以总结如下:
1)我们提供了一个模块化机器人系统，它可以预测、规划和执行场景中对象的对映抓取。我们开源了所提议的inference1和control2模块的实现。
2)我们提出了一种新的生成剩余卷积神经网络架构，它可以预测相机视场中物体的合适的对拓抓取构型。
3)我们在公开的抓取数据集上评估了我们的模型，在Cornell和Jacquard抓取数据集上分别达到了最高水平的97.7%和94.6%。
4)我们证明，提出的模型可以部署在机器人手臂上，以实时速度执行对拓抓取，对家居和对抗性物体的成功率分别为95.4%和93%。

四，方法

IV. APPROACH

We propose a dual-module system to predict, plan and perform antipodal grasps for the objects in the scene. The overview of the proposed system is shown in fig.1. The inference module is used to predict suitable grasp poses for the objects in the camera’s field of view. The control module uses these grasp poses to plan and execute robot trajectories to perform antipodal grasps.

我们提出了一个双模块系统来预测、计划和执行对跖点抓取。本系统的总体概况如图1所示。推理模块用于预测相机视场内目标的合适抓位。控制模块使用这些抓取姿态来规划和执行机器人的轨迹来执行对跖点抓取。

A. Inference module

The inference module consists of three parts. First, the input data is pre-processed where it is cropped, resized, and normalized. If the input has a depth image, it is inpainted to obtain a depth representation [30]. The 224×224 n-channel processed input image is fed into the GR-ConvNet. It uses n-channel input that is not limited to a particular type of input modality such as a depth-only or RGB-only image as our input image. Thus, making it generalized for any kind of input modality. The second generates three images as grasp angle, grasp width, and grasp quality score as the output using the features extracted from the pre-processed image using GR-ConvNet. The third infers grasp poses from the three output images.

A: 推理模块
推理模块由三个部分组成。首先，对输入数据进行预处理，对其进行裁剪、调整大小和规范化。如果输入有一个深度图像，它被去瑕疵以获得一个深度表示[30]。将经过处理的224×224 n通道输入图像送入GR-convNet。它使用n通道的输入，而不局限于特定类型的输入模态，例如只使用深度或只使用rgb的图像作为输入图像。因此，它可以推广到任何类型的输入模态。第二种方法利用gri - convnet从预处理后的图像中提取的特征，生成三幅图像，分别作为抓角、抓宽和抓质量分数作为输出。第三个推断器从三个输出图像中抓取姿态。

B. Control module

The control module mainly incorporates a task controller that performs tasks such as pick-and-place and calibration. The controller requests a grasp pose from the inference module which returns the grasp pose with the highest quality score. The grasp pose is then converted from camera coordinates into robot coordinates using the transform calculated from hand-eye calibration [31]. Further, the grasp pose in robot frame is used to plan a trajectory to perform the pick and place action using inverse kinematics through a ROS interface. The robot then executes the planned trajectory. Due to our modular approach and ROS integration, this system can be adapted for any robotic arm.

B,控制模块
该控制模块主要包含一个任务控制器，用于执行拾取放置和校准等任务。控制器向推理模块请求一个抓位，推理模块返回质量分数最高的抓位。通过手眼标定[31]计算变换，将抓取姿态从摄像机坐标转换为机器人坐标。此外，机器人框架中的抓取姿态被用于规划轨迹，通过ROS接口使用逆运动学执行取放动作。然后机器人执行计划的轨迹。由于我们的模块化方法和ROS集成，该系统可以适用于任何机械臂。

C. Model architecture

Fig. 2 shows the proposed GR-ConvNet model, which is a generative architecture that takes in an n-channel input image and generates pixel-wise grasps in the form of three images. The n-channel image is passed through three convolutional layers, followed by five residual layers and convolution transpose layers to generate four images. These output images consist of grasp quality score, required angle in the form of cos 2Θ, and sin 2Θ as well as the required width of the end effector. Since the antipodal grasp is uniform around ±π2 , we extract the angle in the form of two elements cos 2Θ and sin 2Θ that output distinct values that are combined to form the required angle.

c模型架构
图2显示了所提议的GR-ConvNet模型，这是一个生成架构，接受n通道输入图像，并以三幅图像的形式生成像素级的抓取。n通道图像经过3个卷积层，再经过5个残差层和卷积转置层，生成4个图像。所需的掌握质量分数,这些输出图像由角的形式cos2Θ,sin2Θ以及所需的终端执行器的宽度。映以来掌握统一在±2π,我们提取角的两个元素cos2Θ sin2Θ输出不同值组合起来形成所需的角度。

在这里插入图片描述