2024/2/24: 模仿学习 Eliciting Compatible Demonstrations for Multi-Human Imitation Learning

本文链接：https://blog.csdn.net/wdnmdwsmsa/article/details/136263996

CoRL2022 Poster
Author: Kanishk Gandhi, Siddharth Karamcheti, Madeline Liao, Dorsa Sadigh
Keywords: Interactive Imitation Learning, Active Demonstration Elicitation, Human Robot Interaction

1. Abstract

利用专家数据进行模仿学习是机器人学习操作的强大方法。但人工提供的数据往往具有同质性、低方差特点，反映出对应任务以及最优策略的单一。但是人类的行为是具有异质性，一种任务可以有不同解。本文提出一种在线交互式模仿学习框架，通过迭代收集新的演示数据不断改进策略。

为了防止新加入的演示不兼容，这项工作设计了一种方法：1）在给定基本策略的情况下测量新演示的兼容性，以及 2）主动从新用户那里引出更兼容的演示。在机械臂实验中验证了可以通过事后过滤来识别不兼容的演示，并应用兼容性度量来主动从新用户那里引出兼容的演示，从而提高模拟和真实环境中的任务成功率。

2. Method

2.1 Learning to Measure Compatibility in Multi-Human Demonstrations

兼容性测量模型 $\mathcal{M}$ ：估计基础策略 $\pi_{base}$ 在基础数据集 $D_{base}$ 与新数据集 $D_{new}$ 上的性能

$\mathcal{M}=\begin{cases}1-\min\left(\frac{(\pi_{\mathrm{bace}}(s_{\mathrm{new}})-a_{\mathrm{new}})^2}\lambda,1\right)&\text{if novelty}(s_{\mathrm{new}})<\eta\\1&\text{otherwise.}\end{cases}$
其中novelty则是采用 $\pi_{base}$ 对状态预测动作的标准差衡量。由定义可以看出，当新演示状态的novelty足够高( $\geq\eta$ )或者 $\pi_{\mathrm{base}}(s_{\mathrm{new}})=a_{\mathrm{new}}$ 兼容性度量值为1。而当 $(\pi_{\mathrm{base}}(s_{\mathrm{new}})-a_{\mathrm{new}})^{2}\geq\lambda$ 时兼容性度量值0。