Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

本文介绍了使用概率动力学模型的深度强化学习方法,旨在提高样本效率并解决模型过拟合问题。通过概率神经网络和确定性神经网络的集成,模型能同时处理低数据和高数据场景的不确定性。利用这些模型进行规划和控制,尽管预计轨迹奖励的计算具有挑战性,但通过轨迹采样策略,如TS1和TS∞,仍能实现有效的状态传播。
摘要由CSDN通过智能技术生成

motivation

Model-based approaches enjoys 1) sample efficiency (meaning they learn quickly), 2) and a reward-independent dynamics model (thinking of model-free approaches require the reward function to update), but meanwhile lagging behind model-free approaches in asymptotic performance (meaning they converge to sub-optimal solutions).

This work based on two observations:

  1. model capacity matters
    GP is efficient but lacks expressiveness, NN leans slowly?
  2. the above issue can be mitigated by incorporating uncertainties.
    (actually i didn’t find any reasoning)

Talking of related works, the paper claims that deterministic NN used in many prior works suffer from overfitting in the ealy stages of learning.

The author mentions a major challenge in model-based RL: model should perform well in both low and high data regimes.

Q2:

What causes this? Is this specific under the setting of model-based RL?

pipeline

probabilistic ensemble dynamics model

dynamics model

  1. probabilistic NN
    a parametrized conditional distribution model f θ ( s t + 1 ∣ s t , a t ) f_\theta(s_{t+1}\mid s_t, a_t) fθ(st+1st,at), optimized by Maximizing the Likelihood of environment-produced trajectories.

    A typical choice of the distribution is a diagonal multiunivariate Gaussian. This is similar to the model for predicting actions given states in continuous action space. The model would give a state mean vector and a state variance vector, and the next state is produced by sampling from such Gaussian.

  2. deterministic NN
    f ( s t

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值