Introduction
The goal of this work is to enable a robotic generalist to only learn from very few demonstrations, which may even be raw videos. This problem setting instantly brings us into the setting of one-shot (few data) visual imitation learning (learn from demos).
The paper mentions two major challenge of imitation learning: 1) compounding errors (which is not settled in this work), and 2) plenty of data. Prior efforts in reducing data resort to Inverse RL which can infer the reward function from a few demos. Here the author starts from meta-learning which compensates the lack of data via reusing experiences learned from similar (transferable, if more accurately speaking) tasks.
So in the background of meta-learning, the problem is re-formulated as how to effectively learn a policy (from meta-train tasks) that can quickly adapt to new tasks (meta-test tasks). Quick learning is required because we have only very few samples (demos of course).
MAML for imitation learning
Suppose we have a lot of tasks obeying a task distribution P ( T ) P(T) P(T), where each task is defined by a triplet of (expert demonstrations, loss function, task description) that follows the setting of imitation learning.
Meta-train dataset is composed of sampled tasks used for meta-learning, meta-test dataset is one for evaluating the model’s adaptability performance.
MAML (Model-Agnostic Meta-Learning) is straight-forward if I interpret it as finding such policy parameters that (will) achieve maximal overall performance after its normal gradient update for each meta-train task specific to the task’s loss function.
This is exactly following the raw intent of fast adapt for new tasks, as the objective function itself represents an expectation of performance across meta-train tasks, and we update parameters with its gradients, indicating that the objective is being pursued.
To break this approach apart more precisely (and clearly), there are actually three steps: 1) disguise we perform fast adapt to some unknown tasks, 2) sum over the performance of theses imaginary fast adapted solutions, 3) optimize this overall performance to find the (optimal) starting point from which to begin fast adapt.
Notice there are two gradient update step in the approach, one for the inner loss, another for the outer. As this is what meta-learning process composes, at least two samples is needed for the update. So in the meta-train dataset, each task should be equipped with at least two demonstrations.
Besides, although the original problem setting is claiming to input only raw videos - which refers to the testing stage (meta-test) indeed - this is not true for meta-train dataset. I think it would be safer to let the robot to at least experience what the standard actions should be, instead of somehow extracting those from raw videos that might well involve errors.
A messy point of MAML is that, we estimate the expected performance of fast adapt on the basis of the initi