Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behavior
author
Berkeley - Sergey Levine 团队
aim
Our aim is to recover a user’s implicit beliefs about the dynamics of the world from observations of how they act to perform a set of tasks.
insight
Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user’s internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent.
一个人的行为很可能与他实现目标的最优行为之间有一定偏差,本文希望通过观察这些不是最优的行为来推测其真实目标。
hypothesis
- The user’s internal dynamics model is stationary, which is reasonable for problems like robotic teleoperation when the user has some experience practicing with the system but still finds it unintuitive or difficult to control.
- The real dynamics are known ex-ante or learned separately.
- Our core assumption is that the user’s policy is near-optimal with respect to the unknown internal dynamics model. (行为是次优的,但不可以是不靠谱的)
contribution
The main contribution of this paper is a new algorithm for intent inference that first estimates a user’s internal beliefs of the dynamics of the world using observations of how they act to perform known tasks, then leverages the learned internal dynamics model to infer intent on unknown tasks.
contents
We split up the problem of intent inference into two parts: learning the internal dynamics model from user demonstrations on known tasks (the topic of this section), and using the learned internal model to infer intent on unknown tasks (discussed later in Section 4).
Internal Dynamics Model Estimation
The key idea behind our algorithm is that we can fit a parametric model of the internal dynamics model T that maximizes the likelihood of observed action demonstrations on a set of training tasks with known rewards by using the soft Q function as an intermediary.
文章的目的是建模internal dynamics model,这里采用的方法是拟合一个参数模型。
-
Inverse Soft Q-Learning
-
What we need? And how to tie them together?
observed action
known rewards
We can fit a parametric model of the internal dynamics model T that maximizes the likelihood of observed action demonstrations on a set of training tasks with known rewards by using the soft Q function as an intermediary. We tie the soft Q function to action likelihoods using Equation , which encourages the soft Q function to explain observed actions.
Equation 1 T: internal dynamics model
Q: soft Q function
We tie the internal dynamics T to the soft Q function via the soft Bellman equation, which ensures that the soft Q function is induced by the internal dynamics T.
-
-
Formulating the optimization problem
action space: A
known reward function: Ri(s,a,s0)
unknown internal dynamics: T
unknown soft Q function for task i: Qi
a function approximator for Q: Qθi
a function approximator for T: Tφ
将上面的soft bellman方程改写成等价的soft bellman error的形式
用求解约束优化问题的方式拟合参数φ和θi :
where Ddemo i are the demonstrations for task i, and πθi denotes the action likelihood given by Qθi and Equation 1.
-
Solving the optimization problem
用penalty method将约束优化问题转化成,非约束优化问题:
Regularizing the Internal Dynamics Model
基于上述方法训练的得到的模型可能有好几种,如何找到最优的?下面介绍了两种方法:
-
Multiple training tasks
就是采用多组数据试一试
-
Action intent prior
真实世界的dynamic model与拟合的dynamic model之间应该是可以相互转化的(可达性)来判断: