强化学习——Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations

本文链接：https://blog.csdn.net/cocapop/article/details/136627781

链接：

我们的目标是在一个框架内实现可靠的 6DoF 闭环抓取，该框架要足够灵活，能够处理新型物体和带有移动物体的动态场景配置。我们发现，通过在一个大型人类演示数据集上训练视觉抓取值函数（使用基于视图的渲染进行数据增量）（通过配备腕部摄像头的手持式抓手收集），这一目标是可以实现的。第四节介绍了我们从各种任务和环境（即野外）中收集人类抓取演示的硬件设置和数据收集过程。第五部分介绍了我们的 6DoF 闭环抓取模型，以及如何利用这些数据对其进行训练。 The task of closed-loop grasping requires an action policythat enables the robot to move its gripper towards an object,approach it from an angle that is likely to lead to a stablegrasp. This pre-grasp approaching process is a time-varyingsequence of actions, for which rewards are loosely defined.and has previously been shown to be more effectively learnedthrough reinforcement than from direct supervision [3], [19].We formulate this vision-based grasping problem as aMarkov decision process: given state s at time t, the robotchooses and executes an action ar according to a policyπ(s,), then transitions to a new state s,+l and receives areward r. The goal of reinforcement learning is to fndan optimal policy r* that selects actions which maximizethe total expected rewards Q(s,,a,)=∑λi-'ri, i.e., λ-discounted sum over an infnite-horizon of future returnsfrom time t to ∞. In this work, we use off-policy Q-learningto learn the optimal parameterized Q-function Qe(s,, a)(i.e.,state-action value function), where 0 might denote weightsof a neural network. Formally, our learning objective is to iteratively minimize the temporal difference error 8 betweenQe(sr,a,) and a target value yr: ǎ=|Qə(sr,a)-yy = r +λQe(sr+1,argmax(Qe(sr+1,.+1)))G+1 where . is the set of all available actions at time t.Within our formulation, we represent each state s, as a im-age observation from the wrist-mounted camera. We parame-terize each action a, as a 6DoF rigid transform that encodesthe relative rotation and translation from the current robotend effector pose to the next target pose. Motion planningbetween end effector poses is autonomously executed on thereal robot using standard proportional-derivative (PD) controlwith inverse kinematics (k)solvers. The algorithm outputsa gripper closing signal by using depth observations fromthe camera to measure proximity to objects. The algorithmchecks the local region of depth values between fingertips,and issues a close command if the nearest 1% of depth inthis area is smaller than a dmathrmclose = depth of fingertips-0.015m. After the gripper attempts to close, the system liftsthe gripper up 0.1m and checks the finger width to determinegrasp success. Each grasping trajectory begins with the endeffector initially positioned 50cm away overlooking the sceneof objects, and terminates after 40 state transitions or after asuccessful grasp. Rewards are provided r=1 for successfulgrasps and r=0 otherwise.

这篇论文中介绍的抓取方法不是端到端的抓取，而是基于强化学习（reinforcement learning）的视觉导向抓取系统。以下是该系统的主要组成部分和实现方法：

问题背景： 论文中的任务是闭环抓取，即机器人需要执行一系列动作以从不同角度逼近物体，并进行稳定的抓取。这个过程是一个时变的动作序列，通过奖励函数松散定义，并且前期逼近过程的学习更适合通过强化学习而不是直接监督学习。
问题建模： 将视觉导向抓取问题建模为马尔可夫决策过程（Markov Decision Process, MDP），其中在每个时刻 t，机器人根据策略 π(s, a) 选择并执行一个动作 a，然后转移到新的状态 s'，并接收奖励 r。
强化学习： 通过强化学习的方式解决该问题，目标是学习一个最优策略 π*，该策略选择能够最大化总预期奖励的动作。采用离策略 Q 学习，学习参数化 Q 函数 Qθ(s, a)，其中 θ 表示神经网络权重。目标是通过迭代地最小化时间差异误差 δ 来训练 Q 函数。
状态和动作表示： 论文中将每个状态 s 表示为手腕摄像头的图像观察。动作 a 被参数化为一个 6DoF 刚性变换，该变换编码当前机器人末端执行器姿态到下一个目标姿态的相对旋转和平移。运动规划在末端执行器姿态之间使用标准的比例-积分-微分（PD）控制和逆运动学求解器进行自主执行。
抓取策略： 根据深度观察来测量到物体的距离，系统发出夹爪关闭信号。在夹爪尝试关闭后，系统将夹爪提升 0.1m，并检查指宽以确定抓取成功。每个抓取轨迹从末端执行器最初位于距离场景中对象 50cm 处的位置开始，并在 40 个状态转换或成功抓取后终止。奖励 r=1 表示成功抓取，r=0 表示抓取失败。

因此，这个系统采用了强化学习的方法，通过学习 Q 函数来指导机器人执行抓取动作序列。整个系统是通过分阶段进行的，而不是直接从原始输入到最终输出进行端到端的学习。

强化学习是机器学习的一个分支，与深度学习和迁移学习可以结合使用，但它本身并不限定于这两个领域。下面是对这几个术语的简要解释：

机器学习（Machine Learning，ML）： 机器学习是一种让计算机系统通过经验自动改进的技术。它涵盖了多种方法和算法，包括但不限于监督学习、无监督学习、强化学习等。机器学习的目标是让计算机系统具备通过学习从数据中提取模式和规律的能力。
强化学习（Reinforcement Learning，RL）： 强化学习是机器学习的一类，其中一个智能体（agent）通过与环境的交互学习如何采取行动以最大化奖励信号。强化学习适用于通过试错来学习，智能体根据其动作的结果获得奖励或惩罚，并通过调整策略来优化其行为。
深度学习（Deep Learning，DL）： 深度学习是机器学习的一个分支，它使用人工神经网络进行学习。深度学习的关键是使用深层神经网络来学习抽象层次的特征表示，这对于处理复杂的模式和大规模数据集非常有效。深度学习通常与监督学习、无监督学习和强化学习结合使用。
迁移学习（Transfer Learning）： 迁移学习是一种机器学习的范式，其中通过将已学到的知识从一个任务或领域迁移到另一个任务或领域来提高学习性能。迁移学习可以与深度学习和强化学习结合使用，以利用在一个任务上学到的特征或知识来改善在另一个相关任务上的性能。

因此，强化学习是机器学习的一个子集，它可以结合使用深度学习和迁移学习等技术，具体取决于解决的问题和任务的性质。