Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Hazekiah

于 2018-11-24 17:27:02 发布

阅读量671

点赞数

分类专栏： RL

本文链接：https://blog.csdn.net/u010909964/article/details/84450913

版权

本文介绍了使用概率动力学模型的深度强化学习方法，旨在提高样本效率并解决模型过拟合问题。通过概率神经网络和确定性神经网络的集成，模型能同时处理低数据和高数据场景的不确定性。利用这些模型进行规划和控制，尽管预计轨迹奖励的计算具有挑战性，但通过轨迹采样策略，如TS1和TS∞，仍能实现有效的状态传播。

摘要由CSDN通过智能技术生成

motivation

Model-based approaches enjoys 1) sample efficiency (meaning they learn quickly), 2) and a reward-independent dynamics model (thinking of model-free approaches require the reward function to update), but meanwhile lagging behind model-free approaches in asymptotic performance (meaning they converge to sub-optimal solutions).

This work based on two observations:

model capacity matters
GP is efficient but lacks expressiveness, NN leans slowly?
the above issue can be mitigated by incorporating uncertainties.
(actually i didn’t find any reasoning)

Talking of related works, the paper claims that deterministic NN used in many prior works suffer from overfitting in the ealy stages of learning.

The author mentions a major challenge in model-based RL: model should perform well in both low and high data regimes.

Q2:

What causes this? Is this specific under the setting of model-based RL?

pipeline

probabilistic ensemble dynamics model

dynamics model

probabilistic NN
a parametrized conditional distribution model $f_\theta(s_{t+1}\mid s_t, a_t)$ , optimized by Maximizing the Likelihood of environment-produced trajectories.

A typical choice of the distribution is a diagonal multiunivariate Gaussian. This is similar to the model for predicting actions given states in continuous action space. The model would give a state mean vector and a state variance vector, and the next state is produced by sampling from such Gaussian.
deterministic NN

最低0.47元/天解锁文章

Hazekiah

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

motivationModel-based approaches enjoys 1) sample efficiency (meaning they learn quickly), 2) and a reward-independent dynamics model (thinking of model-free approaches require the reward function to...
复制链接

扫一扫

专栏目录