Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behavior

Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behavior

author

Berkeley - Sergey Levine 团队

aim

Our aim is to recover a user’s implicit beliefs about the dynamics of the world from observations of how they act to perform a set of tasks.

insight

Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user’s internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent.

一个人的行为很可能与他实现目标的最优行为之间有一定偏差,本文希望通过观察这些不是最优的行为来推测其真实目标。

hypothesis

  • The user’s internal dynamics model is stationary, which is reasonable for problems like robotic teleoperation when the user has some experience practicing with the system but still finds it unintuitive or difficult to control.
  • The real dynamics are known ex-ante or learned separately.
  • Our core assumption is that the user’s policy is near-optimal with respect to the unknown internal dynamics model. (行为是次优的,但不可以是不靠谱的)

contribution

The main contribution of this paper is a new algorithm for intent inference that first estimates a user’s internal beliefs of the dynamics of the world using observations of how they act to perform known tasks, then leverages the learned internal dynamics model to infer intent on unknown tasks.

contents

We split up the problem of intent inference into two parts: learning the internal dynamics model from user demonstrations on known tasks (the topic of this section), and using the learned internal model to infer intent on unknown tasks (discussed later in Section 4).

Internal Dynamics Model Estimation

The key idea behind our algorithm is that we can fit a parametric model of the internal dynamics model T that maximizes the likelihood of observed action demonstrations on a set of training tasks with known rewards by using the soft Q function as an intermediary.

文章的目的是建模internal dynamics model,这里采用的方法是拟合一个参数模型。

  • Inverse Soft Q-Learning

    • What we need? And how to tie them together?

      observed action

      known rewards

      We can fit a parametric model of the internal dynamics model T that maximizes the likelihood of observed action demonstrations on a set of training tasks with known rewards by using the soft Q function as an intermediary. We tie the soft Q function to action likelihoods using Equation , which encourages the soft Q function to explain observed actions.

      在这里插入图片描述

      Equation 1

      T: internal dynamics model

      Q: soft Q function

      We tie the internal dynamics T to the soft Q function via the soft Bellman equation, which ensures that the soft Q function is induced by the internal dynamics T.

在这里插入图片描述

Equation 2

  • Formulating the optimization problem

    action space: A

    known reward function: Ri(s,a,s0)

    unknown internal dynamics: T

    unknown soft Q function for task i: Qi

    a function approximator for Q: Qθi

    a function approximator for T: Tφ

    将上面的soft bellman方程改写成等价的soft bellman error的形式

在这里插入图片描述

Equation 3

用求解约束优化问题的方式拟合参数φ和θi :

在这里插入图片描述

Equation 4

where Ddemo i are the demonstrations for task i, and πθi denotes the action likelihood given by Qθi and Equation 1.

  • Solving the optimization problem

    用penalty method将约束优化问题转化成,非约束优化问题:

在这里插入图片描述

Equation 5

Regularizing the Internal Dynamics Model

基于上述方法训练的得到的模型可能有好几种,如何找到最优的?下面介绍了两种方法:

  • Multiple training tasks

    就是采用多组数据试一试

  • Action intent prior

    真实世界的dynamic model与拟合的dynamic model之间应该是可以相互转化的(可达性)来判断:

在这里插入图片描述

Equation 6
Experiment

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值