Learning Latent Dynamics for Planning from Pixels

最新推荐文章于 2023-05-16 16:42:07 发布

JasonSparrow_1

最新推荐文章于 2023-05-16 16:42:07 发布

阅读量694

点赞数 2

分类专栏：总结学习笔记文章标签： RL

本文链接：https://blog.csdn.net/JasonSparrow_1/article/details/92811546

版权

学习笔记同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

总结

4 篇文章 0 订阅

订阅专栏

Abstract

The Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space.
PlaNet一个纯的model-based的agent，可以基于图片进行动态规划
为了有比较好的表现效果，需要精确预测多个时间片后的reward
同时使用确定性和随机转移分量
仅使用像素作为输入，解决了连续控制问题的动态规划，部分观测空间和离散reward的问题，并且比model-free的效果好很多

Introduction

解决了许多DeepMind的控制场景，效果远超过A3C和一些情况的D4PG
同时拥有确定性和随即转移分量对于高规划性是至关重要的
包含多步预测的标准变化边界：仅在潜在空间中使用term，就可以得到一个快速正则化器，它可以改进长期预测，并与任何潜在序列模型兼容

Latent Space Planning

Problem setup

为了解决观测空间不全的问题，考虑使用partially observable Markov decision process (POMDP)
在这里插入图片描述

Model-based planning

We use model-predictive control (MPC; Richards, 2005) to allow the agent to adapt its plan based on new observations, meaning we replan at each step. In contrast to model-free and hybrid reinforcement learning algorithms, we do not use a policy or value network.

使用model-predictive control (MPC)方式允许agent基于新的observation来进行计划，也就是说可以在每一步重新规划。
和model-free的方法不同，不适用policy或者value网络。

Experience collection

Starting from a small amount of S seed episodes collected under random actions, we train the model and add one additional episode to the data set every C update steps.

Planning algorithm

使用cross entropy method方法来搜索模型下的最佳的action序列
重要的是，在接收到下一个观察结果后，对动作序列的belief再次从零均值和单位方差开始，以避免局部最优。
Because the reward is modeled as a function of the latent state, the planner can operate purely in latent space without generating images, which allows for fast evaluation of large batches of action sequences.
规划者可以纯粹在latent state中进行动作序列的评估，所以使得快速评估成为可能

Recurrent State Space Model

Therefore, we use a recurrent state-space model (RSSM) that can predict forward purely in latent space, similar to recently proposed models.(类似于非线性的卡尔曼滤波和VAE(VAE类似于GAN，但是相比GAN的暴力提取特征，VAE先对特征进行建模，的到其分布))

Latent dynamics

一个典型的state-space model
在这里插入图片描述

Variational encoder

在这里插入图片描述

Deterministic path

因为纯随机转移的方法不能很好的记住多个步长的时间信息的内容，虽然在理论上能把方差降为0，但是实际可能找不到这个解。
从而利用一个确定的激活向量序列来解决。(允许模型访问不仅仅是目前的状态还有先前的所有状态)
在这里插入图片描述

Latent Overshooting

在这里插入图片描述

Limited capacity

由于when using a model with limited capacity and restricted distributional family，一步完美的预测并不能巧合的在多部完美预测中发生

Multi-step prediction

需要设定一个distance来进行多步预测

Latent overshooting

需要设计一个目标方程来生成distance为1≤d≤D的多步预测，而不仅仅是一个限定的distance

Experiments

Evaluate PlaNet on six continuous control tasks from pixels
多个设计轴：

确定性和随机的路线都存在于动态规划模型里；
迭代计划计算；
在线经验收集。
除了action重复，我们对于所有的任务使用相同的一套超参数

和model-free的方法比较：

在这里插入图片描述

模型设计

The deterministic part allows the model to remember information over many time steps.
The stochastic component is even more important – the agent does not learn without it.