Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behavior

最新推荐文章于 2024-09-24 09:40:31 发布

sunflower_level2

最新推荐文章于 2024-09-24 09:40:31 发布

阅读量149

点赞数

文章标签：强化学习人工智能机器学习

本文链接：https://blog.csdn.net/agent_snail/article/details/108670102

版权

Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behavior

author

Berkeley - Sergey Levine 团队

aim

Our aim is to recover a user’s implicit beliefs about the dynamics of the world from observations of how they act to perform a set of tasks.

insight

Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user’s internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent.

一个人的行为很可能与他实现目标的最优行为之间有一定偏差，本文希望通过观察这些不是最优的行为来推测其真实目标。

hypothesis

The user’s internal dynamics model is stationary, which is reasonable for problems like robotic teleoperation when the user has some experience practicing with the system but still ﬁnds it unintuitive or difﬁcult to control.
The real dynamics are known ex-ante or learned separately.
Our core assumption is that the user’s policy is near-optimal with respect to the unknown internal dynamics model. （行为是次优的，但不可以是不靠谱的）

contribution

The main contribution of this paper is a new algorithm for intent inference that ﬁrst estimates a user’s internal beliefs of the dynamics of the world using observations of how they act to perform known tasks, then leverages the learned internal dynamics model to infer intent on unknown tasks.

We split up the problem of intent inference into two parts: learning the internal dynamics model from user demonstrations on known tasks (the topic of this section), and using the learned internal model to infer intent on unknown tasks (discussed later in Section 4).

Internal Dynamics Model Estimation

The key idea behind our algorithm is that we can ﬁt a parametric model of the internal dynamics model T that maximizes the likelihood of observed action demonstrations on a set of training tasks with known rewards by using the soft Q function as an intermediary.

文章的目的是建模internal dynamics model，这里采用的方法是拟合一个参数模型。

Inverse Soft Q-Learning
- What we need? And how to tie them together?
  
  observed action
  
  known rewards
  
  We can ﬁt a parametric model of the internal dynamics model T that maximizes the likelihood of observed action demonstrations on a set of training tasks with known rewards by using the soft Q function as an intermediary. We tie the soft Q function to action likelihoods using Equation , which encourages the soft Q function to explain observed actions.
  
  Equation 1
  T: internal dynamics model
  
  Q: soft Q function
  
  We tie the internal dynamics T to the soft Q function via the soft Bellman equation, which ensures that the soft Q function is induced by the internal dynamics T.

在这里插入图片描述

Equation 2

Formulating the optimization problem

action space: A

known reward function: R_i(s,a,s0)

unknown internal dynamics: T

unknown soft Q function for task i: Q_i

a function approximator for Q: Q_θi

a function approximator for T: T_φ

将上面的soft bellman方程改写成等价的soft bellman error的形式