【论文笔记】Towards Corrective Deep Imitation Learning in Data Intensive Environments_corrective advice communicated by humans-CSDN博客

本文链接：https://blog.csdn.net/m0_48948682/article/details/127605780

【论文笔记】Towards Corrective Deep Imitation Learning in Data Intensive Environments: Helping robots to learn faster by leveraging human knowledge

Abstract

Interactive imitation learning refers to learning methods where a human teacher interacts with an agent during the learning process providing feedback to improve its behavior（提供反馈信息提升智能体行为）.

提出的问题 However, this（深度强化学习的经验回放） causes conflicts between the data in the buffer because samples collected by older versions of the policy may be contradictory and could deteriorate the performance of the current policy.

工作一 The present thesis focuses on interactive learning with corrective feedback（具有矫正反馈的交互性学习） and, in particular, in the framework Deep Corrective Advice Communicated by Humans（论文提出的模型名字） (DCOACH), which has successfully shown to be advantageous in terms of training time and data efficiency.
工作二 The current implementation of DCOACH uses a first-in-first-out buffer with limited size, as the older the sample is, the more likely it is to deteriorate the performance of the learner（经验池上面采用队列式数据结构，先进先出，减少老策略采样的历史对新策略的影响）.

这种方法存在一个问题：就是一方面要降低数据的复杂性和冲突性，另一方面要避免遗忘
工作三 We propose an improved version of DCOACH, which we call Batch Deep COACH (BDCOACH, pronounced “be the coach”). BDCOACH incorporates a human model module that learns the feedback from the teacher and that can be employed to make corrections gathered by older versions of the policy still useful for batch updating the current version of the policy.
工作四 仿真实验、实物实验都有

1 Introduction

However, these examples（深度强化学习的一些实例） tend to happen in simulated environments with very specific learning tasks.

Furthermore, many real problems are easier to demonstrate（演示，做示范） than to design a reward function（设计奖励函数） for applying reinforcement learning.

Behavioral cloning has two main drawbacks: First, it requires demonstrations from an expert teacher which limits the possibilities of who can train the agent.需要专家的经验，这就意味着并非谁都能做演示；And secondly, it suffers from covariate shift（协方差转变）, a distribution mismatch problem that initiates at the moment the agent deviates from the expert trajectory causing a cascade of errors that will probably make the agent fail the task（一步错步步错）.

Interactive imitation learning (IIL，交互式模仿学习) is a branch of imitation learning that deals with the aforementioned issues by allowing a teacher to help the agent learn during its training.

In this work, we focus on corrections which gives name to the branch of IIL called corrective imitation learning (CIL，纠正模仿学习，是交互式模仿学习的一个分支). In CIL frameworks the human teacher sends corrections informing the agent whether the value of a taken action should be increased or decreased（采取的动作的价值需要提升或降低）.

The goal of this master thesis is to create an extension of Deep Corrective Advice Communicated by Humans, DCOACH, a CIL algorithm designed for non-expert humans that uses an artificial neural network as a function approximator for its policy（DCOACH，不需要专家教学辅助）.

ER（Experience Replay，经验回放） endows algorithms with two main advantages, these being a higher data efficiency（更高的数据有效性） and the ability to train with uncorrelated data（非相关的数据）.

Collected past experiences can be reused multiple times and the ANN gets more robust against locally over-fitting（局部过拟合） to the most recent trajectories, a phenomenon known as catastrophic forgetting（灾难性遗忘，指的是局部过拟合，陷进去走不出来了）.Note that we refer to ER as corrections replay since, in this work, we replay old corrective feedback.

This forces to limit the size of the buffer which works under the assumption that the data stored in the replay buffer is still valid for the current version of the policy, even if it was collected by an older version of the policy.

As the size of the replay buffer starts to increase, this assumption does not hold anymore and the training of the policy will most likely fail, limiting therefore, the types of problems that DCOACH can address. 当经验池的大小不断增加的时候，老策略产生的经验在新策略下也能使用的这个假设不成立了，因此这个算法DCOACH失效了。

BDCOACH incorporates a human model module that learns the feedback from the teacher（增加一个从教师那边反馈得到的学习） and that can be employed to make corrections gathered by older versions of the policy（有老策略产生的经验生成矫正数据） still useful for batch updating the current version of the policy（在新策略更新过程中仍然有效）.

我的理解是：DCOACH作为校正性模仿学习，且不需要专业认识的经验，作者这个在模仿学习中比较实用。因此这项工作是基于DCOACH的，但是这个算法里面的经验池大小是固定的，但是在序列动作产生时，累计的经验是不断变多的，对应标题的data intensive，不断增大的经验池会导致DCOACH失效/效果变坏，因为DCOACH是基于经验池大小不变且老策略的历史与新策略一致的假设，因此提出了自己的BDCOACH。

2 Background and Related Work

本章主要是对强化学习做一些回顾

2.1. Reinforcement Learning

2.1.1. On-policy and Off-policy Reinforcement Learning

According to Sutton and Barto, off-policy methods use two policies: They evaluate or improve a policy, the target policy, while using a different policy to generate the data, the behavior policy.

On the other hand, on-policy methods use a single policy, the same policy that is evaluated or improved, is the one used to generate behavior.

2.1.2. On-line and Off-line Reinforcement Learning

Traditional RL algorithms are on-line frameworks where the agent iteratively interacts in its environment, collecting experience to update its policy. The on-line approach works well in simulated environments however, for real world settings, on-line learning is impractical because the agent still needs to collect a diverse and large dataset.

Offline reinforcement learning, addresses the aforementioned problem. The key idea is that using only previously collected data, the agent has to learn the best possible policy without additional online data collection. With this offline framework, it is possible to apply RL to real world domains like robotics where the agent, the robot, could easily get damaged while collecting data iteratively in an online manner.

2.1.3. Experience Replay

Experience replay provides several benefits.

First, it is an efficient way of taking advantage of previously collected experience by replaying it multiple times.
Furthermore, experience replay provides uncorrelated data to train the neural network, which helps it to generalize and to minimize over fitting to the most recent trajectories.

It is important to remark that even if the experiences were collected with a single policy $π$ , because the policy evolves over time, the same policy at time step $t$ , $π_{t}$ , is not equal to that same policy in a later time step, $π_{N}$ . They are considered experiences gathered by different policies and therefore only off policy methods are applicable.

2.2. Function Approximation with Artificial Neural Networks

In particular, for this thesis, we will focus on fully connected feedforward neural networks (FNN) where information flows in only one direction without going through any loop.

2.3. Imitation Learning

Imitation learning is more useful with respect to reinforcement learning when it is easier for a teacher to demonstrate or provide feedback in order for the agent to learn rather than to specify a reward function that would lead to the desired behavior.

2.3.1. Interactive Imitation Learning

Interactive imitation learning (IIL) is a branch of imitation learning where human teachers can help intelligent agents to learn during their training.

For the human to demonstrate the task when the robot request it.
Evaluative feedback in the form of scalar values. Here the human is presented with several executions of a policy, and he/she has to decide which one is better according to the goal of the task. Then, a reward function that explains the decisions of the human is found, and by applying RL, the agent learns how to perform the task.
Corrective imitation learning improves the informativeness of evaluative feedback, by allowing the teacher to inform the agent whether the value of a taken action should be increased or decreased and it requires less exploration compared to evaluative feedback.

2.3.2. On-Policy and Off-Policy Imitation Learning

In Off-Policy imitation learning, an agent observes demonstrations from a supervisor and tries to recover the behavior via supervised learning, an example of off-policy IL is behavioral cloning.
On-policy imitation learning methods sample trajectories from the agent’s current distribution and update the model based on the data received. A common on-policy algorithm is Dagger.

Dagger的介绍：

But, even if referring to the simple version of DAgger, there is another reason to consider it off policy. By its very nature, DAgger learns from a buffer, in other words, it learns from information gathered by older versions of the policy that are different from the current version.

2.3.3. Online and Offline Imitation Learning

In offline imitation learning, the agent learns by imitating a demonstrator without additional online
environment interactions unlike in the case of online IL.

2.4 Corrective Imitation Learning

The new framework that we propose in Chapter 3 is based on the DCOACH algorithm which, in turn, derives from the COACH algorithm; both methods are presented next.

2.4.1. COACH: Corrective Advice Communicated by Humans

The method Corrective Advice Communicated by Humans, COACH [13], is a CIL framework designed for non-expert humans teachers where the person supervising the learning agent, provides occasional corrections when the agent behaves wrongly.

特点：（1）不要求专业的人员；（2）不需要持续的示教，只有机器人出错了才纠正

This corrective feedback $h$ is a binary signal（二进制信号） that indicates the direction in which the executed action $a = π_{θ}(s)$ , should be modified（指出在状态 $s$ 下的哪个动作是要被修正的）.

$θ$ are updated using a stochastic gradient descent (SGD) strategy in a supervised learning manner.
$J (θ)$ is the mean squared error between the applied and the desired action.
$\theta\leftarrow\theta-\alpha\Delta_{\theta}J(\theta)$

COACH works under the assumption that teachers are non-experts（非专家的假设） and that therefore, they are just able to provide a correction trend（提供一个错误的倾向/趋势） that tells the sign of the policy error but not the error itself. COACH的算法告诉智能体你这里错了，但不会告诉智能体咋错了。

To compute the exact magnitude of the error, COACH incorporates a hyperparameters $e$ that needs to be defined beforehand, resulting in $error_{t} = h_{t} · e$ .

The error needs to be defined as a function of the parameters in order to compute the gradient in the parameter space of the policy.

Thus, the error can also be described as the difference between the desired action generated with the teacher’s feedback, $a^{target}_{t} = a_{t} + error_{t}$ , and the current output of the policy, $a_{t} = π_{θ}(o_{t})$ .
$error_{t}=a_{t}^{target}-a_{t}=a_{t}^{target}-\pi_{\theta}(o_{t}) \\ \theta\leftarrow\theta+\alpha\cdot error_{t}\nabla_{\theta}\pi_{\theta}$

2.4.2. D-COACH: Deep COACH

Deep COACH, DCOACH, is the “deep” version of the COACH algorithm in the sense that it uses an artificial neural network to represent the policy of the agent.

The current version of DCOACH implements the corrections replay technique to be more data efficient. During learning, tuples of old corrections, $(s_{t}, a^{target}_{t} )$ , are stored in a memory buffer $B$ and then they are replayed to update the current policy of the agent.

However, the way that DCOACH implements the replaying of corrections has limitations. In this case, the replay buffer $B$ works by assuming that recent feedback is still being valid to update the most recent version of the policy. Due to this assumption, the size of the buffer that DCOACH implements, needs to be drastically reduced, otherwise old corrections could update the policy in undesired directions of the policy’s parameter space.

On the other hand, a very small replay buffer will provoke an over fitting of the policy to data generated in the most recent trajectories, which limits the current version of DCOACH to work with low data intensive problems.

3 Batch Deep COACH (BD-COACH)

The current version of the CIL algorithm DCOACH is limited to problems that do not require large amounts of data as its replay buffer needs to be kept small.

3.1. Difference between D-COACH and BD-COACH

Therefore, a task with a combination of a complex and high dimensional observation space plus a long horizon and a complex observation action mapping, would be very challenging for DCOACH to learn.

In DCOACH, batch updates are independent of the policy as corrections from the buffer do not depend on what the policy is currently doing at those particular states. This fact makes that feedback gathered by older versions of the policy can deteriorate the performance of the current policy.

The human observes the state and the action of the agent at a particular moment and gives a correction accordingly. We introduce a human teacher learner module in our framework as an artificial neural network that takes pairs of state/actions and outputs the appropriate feedback correction（输入动作状态对，输出恰当的反馈校正）. This module, called the human model, is learned in parallel together with the policy.

BDCOACH is able to work with higher demanding data tasks thanks to the human model module that learns to predict the feedback that the teacher provides. These predicted corrections depend on the output of the policy at a particular state making them convenient for updating the current version of the policy.

3.2. Learning Framework

Policy

输入环境的状态，输出相应动作

Replay Buffer

当人类教师提供矫正时，存储经验池。三元组 $s_{t},a_{t},s_{t+1})$ 。

Policy Update Module

The corrections that are fed to the policy update module depend on the actions taken by the policy for those states.

These corrections do not come directly from the replay buffer as in the case of DCOACH but instead are the output of the human model module.

Human Model

BDCOACH incorporates a human model, $H (s, a)$ , that learns to predict the corrective feedback given by the human teacher for inputs of state-action pairs.

The framework Gaussian Process Coach (GPC) also employs a human model that, as our model, does take into account actions in addition to states. The difference is that GPC implements Gaussian processes as function approximator for both its policy and its human model to estimate the uncertainty of states and actions. In the case of BDCOACH, the human model is an artificial neural network that generate labels that are useful for the current version of the policy.

During the batch update of the policy, a mini-batch of states（状态的迷你批次） uniformly sampled from the replay buffer is passed to both the policy and the human model as inputs.

To clarify, these mini-batches of states are different from normal mini-batches as for this step, we do not use the actions or corrections（不需要动作信息和矫正信息） from the buffer.

Human Model Update

The human model update module is in charge of updating the weights of the ANN that represents the human model, $H (s, a)$ . The human model is updated with tuples of information $s_{t}, a_{t}, h_{t})$ stored in the replay buffer, applying like this the corrections replay technique.

3.3. Discussion

4 Experimental Setting

4.1. Meta-World Benchmark

All the tasks in Meta World need an agent that executes an action in the environment equal to $[δ x, δy, δz, g]$ . The first three dimensions of the action correspond to the change in position of the end effector in the three Cartesian axes（末端执行器的三维笛卡尔坐标）. The last dimension represents the gripper effort that keeps the fingers of the end effector open or close. In our case, for this dimension, the expert policy always commands a constant value keeping the gripper open or close depending on the task（末端执行器保持一个常数）. The observation space is a 9 dimensional space formed by the 3D Cartesian positions of the end effector, the object and the goal.

This metric, $∥ o bj ec t - g o a l ∥2 < ϵ$ , is based on the euclidean distance between the object position and the goal position where $ϵ$ is a small distance threshold that varies from task to task.

4.1.1. Simulated Experiments

The goal of the simulated experiments is to compare the performance between BDCOACH and DCOACH as a function of the amount of data required to solve a task.

The reason behind this design choice is that states formed by relative positions, $s = [[xyz_{end_effector}], [xyz_{object}], [xyz_{goal}]]$ , make it easier for the robot to generalize as the number of dimensions decreases（对场景信息做了一些降维处理）.

4.1.2. Task plate-slide-v2

If the puck goes inside the goal in less than 500 time steps, the episode is considered successful and a fail otherwise. The task starts with the gripper and the object always initiated at the same position whereas the goal is initiated randomly within an area of $0.01m^{2}$ .

4.1.3. Task drawer-open-v2

4.1.4. Task button-press-top-down-v2

4.1.5. Synthesized Feedback

Using an oracle removes human factors such as the human teacher providing inconsistent feedback or he/she getting tired which would make comparisons between algorithms unfair. Furthermore, in order to compare performances of different frameworks, it is necessary to run many simulations to obtain a good average of the results which would be completely impractical if the teacher was a real human.

主要意思是，如果是真人教师的话，长时间的指导矫正会导致疲劳，会导致后面的矫正比较乏力。因此需要合成化反馈信号，把他上升到一个“神谕”（Oracle）中。

The oracle used in this work generates feedback by computing $h = sign(a_{teacher} − a_{agent})$ , whereas the decision on whether to provide feedback at each time step is given by the probability $P_{h} = α· exp(−τ ·timestep)$ , where ${α ∈ R |0 ≤ α ≤ 1}$ ; ${τ ∈ R |0 ≤ τ}$ . 在一个指数分布的概率分布下决定是否发出纠正信号，那么纠正信号的发出是按照方波公式算出来的。

Furthermore, this binary feedback $h$ is only provided if the difference between the action of the policy and the action of the teacher is larger than a threshold $ϵ$ .

如果两个差值大于阈值，那么久设置正信号，反之设置负信号。

4.2. Experiments with KUKA Robot Arm

In order to validate the new proposed method BDCOACH on a real robotic setup with real human teachers, we devised two tasks involving a KUKA LBR iiwa 7 robot arm pushing a box placed on top of a table.

Several reflecting markers were attached to the box so its pose could be tracked by an OptiTrack motion capture system. The pose, captured by the eight cameras of the available OptiTrack system, consists of the position and orientation of the central point on the box created by the reflecting markers.（在物体上做一些标记，以便于用OptiTrack来做轨迹演示）

The human that supervises the learning process conveys the corrections with a joystick.

4.2.1. Task KUKA-push-box

Pushing an box in a straight line without a reactive robot is simply impossible to achieve as the box will naturally fall outside the desired straight trajectory. 机器人本身不足以精确到推物体走直线

Figure 4.4 shows this problem where a constant velocity is commanded to the end effector but, because the robot does not react to the misalignments, the box keeps deviating from the desired straight trajectory. KUKA机器人没有推直线，偏离了路线

4.2.2. Task KUKA-park-box

5 Results

5.1. Results of Simulated Tasks

This fact indicates that on certain time steps, the difference between the agent’s actions and the oracle’s actions is smaller than the oracle’s threshold ϵ and therefore it does not require feedback for those time steps.

5.2. Results of Validation in Real System

6 Conclusion

More exhaustive experiments.

BDCOACH has successfully demonstrated in simulation to be able to benefit from the corrections replay technique.

However, it would be necessary to run exhaustive experiments with more human participants to take into account human factors that we have not considered and really prove the benefits of BDCOACH.
Use images as observations.

BDCOACH has been validated only with observations formed by Cartesian positions.

It would be very interesting to see if it is able to keep its performance when the policy is fed with observations form by images.
Validation with longer horizon tasks.

Finally, more complex tasks could be taught to BDCOACH to see to what extent it can leverage the most from the replay buffer.