DARLA: Improving Zero-Shot Transfer in Reinforcement Learning 阅读笔记

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

标签(空格分隔): 论文笔记 增强学习算法


该论文主要讲的是,增强学习算法在不同数据分布上的迁移应用(不需要进行再学习),这篇论文并没有对强化学习的算法做出如何的改进

目的和意义

作者的初衷:强化学习算法会被应用到很多不同的数据分布,然而,强化学习在线学习是非常困难的,再加上数据集的采集,是一个漫长的过程。
现在比较常见的
(1)模拟环境->真实环境;(2)不同的真实环境;
于是, 作者提出来了 多阶段强化学习Agent算法DARLA(DisentAngled Representation Learning Agent)
首先,通过 神经网络进行进行特征提取(a disenstangled representation of the observed environment.),然后进行策略控制。

We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act.
This paper focuses on one of these outstanding issues: the ability of RL agents to deal with changes to the input distribution, a form of transfer learning known as domain adaptation.
We aim to develop an agent that can learn a robust policy using observations and rewards obtained exclusively within the source domain.
a policy is considered as robust if it generalises with minimal drop in performance to the target domain without extra fine-tuning.

然后, 作者说了一波如果没有转换学习会导致什么问题
(1)数据获取成本太高;
(2)在source domain 容易过拟合;

  1. In many scenarios, such as robotics, this reliance on target domain information can be problematic, as the data may be expensive or difficult to obtain (Finn et al., 2017; Rusu et al., 2016). Furthermore, the target domain may simply not be known in advance.
  2. On the other hand, policies learnt exclusively on the source domain using existing deep RL approaches that have few constraints on the nature of the learnt representations often overfit to the source input distribution, resulting in poor domain adaptation performance

作者想设计一个特征表示的方法,能给抓住潜在的低维的特征,且该特征不随
任务和数据分布的改变。

  1. We propose tackling both of these issues by focusing instead on learning representations which capture an underlying low-dimensional factorised representation of the world and are therefore not task or domain specific
  2. We demonstrate how disentangled representations can improve the robustness of RL algorithms in domain adaptation scenarios by introducing DARLA
  3. a new RL agent capable of learning a robust policy on the source domain that achieves significantly better out-of-the-box performance in domain adaptation scenarios compared to various baselines.
  4. DARLA relies on learning a latent state representation that is shared between the source and target domains, by learning a disentangled representation of the environment’s generative factors.

DARLA算法分为三个部分:(1)学习特征表示;(2)学习策略控制;(3)转换。

DARLA does not require target domain data to form its representations. Our approach utilises a three stage pipeline: 1) learning to see, 2) learning to act, 3) transfer.

训练领域和应用领域(source domain and target domain)

source domain / target domain
该迁移学习的的特点在于:
(1)训练数据和测试数据分布差别较大;
(2)在训练数据训练完成之后,在测试数据不进行学习

source domain 和 target domain 之前的数据差别在于:
(1)action space 共享;
(2)transition 和reward function 相似
(3)state space 差别较大
image_1bveknhaa1cbq3t98bh13qa7bd9.png-413.2kB

算法细则

整个算法现将高维的 state Soi 投影到低维 Szi ,实用的方法是非监督学习

  1. In the process of doing so, the agent implicitly learns a function F:Soi>Szi that maps the typically high-dimensional raw observations Soi to typically low-dimensional latent states Szi ; followed by a policy function πi:Szi>Ai that maps the latent states Szi to actions ai
  2. Such a source policy πs is likely to be based on an entangled latent state space Szs
  3. Hence, DARLA is based on the idea that a good quality F learnt exclusively on the source domain DSM will zero-shot generalise to all target domains DiM , and therefore the source policy π(a|SzS;θ) will also generalise to all target domains DiM out of the box.

这个算法分为三部分:
(1)学习特征表示,这部分是全文的关键部分,采用的是非监督学习的方法;
(2)用特征表示输入到强化学习的算法中(DQN,DDPG,A3C);
(3)由sorce domain 向target domain 转换
image_1bvesh878f6on3n172dk46q4k16.png-140.9kB
image_1bvesp6bjh7b1suo1qe414tq1n599.png-284.2kB

所以,这篇论文主要步骤一是关键,下面,来理解步骤一的算法实现

FU 也就是特征表示网络,采用的是 βVAE 算法, 该算法通过无监督学习的方式来自动提取特征表示从原始图像中。

DARLA utilises βVAE , a state-of-the-art unsupervised model for automated discovery of factorised latent representations from raw image data.

首先定义损失函数:
image_1bvetbv5j17ci9ul11mvfpl1iv116.png-32kB

θϕ 分别为encoder和decoder的权值, β 为大于1的超参数, x,z 分别表示原始的数据以及对应的编码向量 x̂  表示经过预训练编解码的结果,所以,把这个整明白了基本上这篇论文就很简单了

后面就是把编码向量 z 输入到强化学习中就OK了
后面会通过代码来,说明 βVAE 的训练方式

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值