One-Shot Visual Imitation Learning via Meta-Learning

最新推荐文章于 2024-09-05 20:48:01 发布

Hazekiah

最新推荐文章于 2024-09-05 20:48:01 发布

阅读量911

点赞数 1

分类专栏： RL

本文链接：https://blog.csdn.net/u010909964/article/details/84501919

版权

本文旨在使机器人通过少量演示（甚至原始视频）就能学习。模仿学习面临的挑战包括累积误差和大量数据需求。作者通过元学习解决数据不足问题，重新定义问题为如何有效地学习一个策略，以便快速适应新任务。文章介绍了MAML（模型无关元学习）用于模仿学习，并探讨了其内在的梯度更新步骤和潜在偏差。还提出了双头架构改进，允许网络自定义适应性评估指标，以及偏置转换来增强梯度的表示能力。最后，提出了关于模型性能和偏置转换益处的问题。

摘要由CSDN通过智能技术生成

Introduction

The goal of this work is to enable a robotic generalist to only learn from very few demonstrations, which may even be raw videos. This problem setting instantly brings us into the setting of one-shot (few data) visual imitation learning (learn from demos).

The paper mentions two major challenge of imitation learning: 1) compounding errors (which is not settled in this work), and 2) plenty of data. Prior efforts in reducing data resort to Inverse RL which can infer the reward function from a few demos. Here the author starts from meta-learning which compensates the lack of data via reusing experiences learned from similar (transferable, if more accurately speaking) tasks.

So in the background of meta-learning, the problem is re-formulated as how to effectively learn a policy (from meta-train tasks) that can quickly adapt to new tasks (meta-test tasks). Quick learning is required because we have only very few samples (demos of course).

MAML for imitation learning

Suppose we have a lot of tasks obeying a task distribution $P (T)$ , where each task is defined by a triplet of (expert demonstrations, loss function, task description) that follows the setting of imitation learning.
Meta-train dataset is composed of sampled tasks used for meta-learning, meta-test dataset is one for evaluating the model’s adaptability performance.

MAML (Model-Agnostic Meta-Learning) is straight-forward if I interpret it as finding such policy parameters that (will) achieve maximal overall performance after its normal gradient update for each meta-train task specific to the task’s loss function. the meta-objective function
the supervised MSE loss of demonstrations This is exactly following the raw intent of fast adapt for new tasks, as the objective function itself represents an expectation of performance across meta-train tasks, and we update parameters with its gradients, indicating that the objective is being pursued.

To break this approach apart more precisely (and clearly), there are actually three steps: 1) disguise we perform fast adapt to some unknown tasks, 2) sum over the performance of theses imaginary fast adapted solutions, 3) optimize this overall performance to find the (optimal) starting point from which to begin fast adapt.

Notice there are two gradient update step in the approach, one for the inner loss, another for the outer. As this is what meta-learning process composes, at least two samples is needed for the update. So in the meta-train dataset, each task should be equipped with at least two demonstrations.

Besides, although the original problem setting is claiming to input only raw videos - which refers to the testing stage (meta-test) indeed - this is not true for meta-train dataset. I think it would be safer to let the robot to at least experience what the standard actions should be, instead of somehow extracting those from raw videos that might well involve errors.

A messy point of MAML is that, we estimate the expected performance of fast adapt on the basis of the initi