DARLA: Improving Zero-Shot Transfer in Reinforcement Learning 阅读笔记

最新推荐文章于 2024-09-15 10:25:19 发布

Charel_CHEN

最新推荐文章于 2024-09-15 10:25:19 发布

阅读量1.5k

点赞数

分类专栏：强化学习文章标签：论文笔记

本文链接：https://blog.csdn.net/Charel_CHEN/article/details/78611253

版权

强化学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

标签（空格分隔）：论文笔记增强学习算法

DARLA Improving Zero-Shot Transfer in Reinforcement Learning

该论文主要讲的是，增强学习算法在不同数据分布上的迁移应用（不需要进行再学习），这篇论文并没有对强化学习的算法做出如何的改进

目的和意义

作者的初衷：强化学习算法会被应用到很多不同的数据分布，然而，强化学习在线学习是非常困难的，再加上数据集的采集，是一个漫长的过程。
现在比较常见的
（1）模拟环境->真实环境；（2）不同的真实环境；
于是，作者提出来了多阶段强化学习Agent算法DARLA（DisentAngled Representation Learning Agent）
首先，通过神经网络进行进行特征提取（a disenstangled representation of the observed environment.）,然后进行策略控制。

We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act.
This paper focuses on one of these outstanding issues: the ability of RL agents to deal with changes to the input distribution, a form of transfer learning known as domain adaptation.
We aim to develop an agent that can learn a robust policy using observations and rewards obtained exclusively within the source domain.
a policy is considered as robust if it generalises with minimal drop in performance to the target domain without extra fine-tuning.

然后，作者说了一波如果没有转换学习会导致什么问题
（1）数据获取成本太高；
（2）在source domain 容易过拟合；

In many scenarios, such as robotics, this reliance on target domain information can be problematic, as the data may be expensive or difficult to obtain (Finn et al., 2017; Rusu et al., 2016). Furthermore, the target domain may simply not be known in advance.
On the other hand, policies learnt exclusively on the source domain using existing deep RL approaches that have few constraints on the nature of the learnt representations often overfit to the source input distribution, resulting in poor domain adaptation performance

作者想设计一个特征表示的方法，能给抓住潜在的低维的特征，且该特征不随
任务和数据分布的改变。

We propose tackling both of these issues by focusing instead on learning representations which capture an underlying low-dimensional factorised representation of the world and are therefore not task or domain specific
We demonstrate how disentangled representations can improve the robustness of RL algorithms in domain adaptation scenarios by introducing DARLA
a new RL agent capable of learning a robust policy on the source domain that achieves significantly better out-of-the-box performance in domain adaptation scenarios compared to various baselines.
DARLA relies on learning a latent state representation that is shared between the source and target domains, by learning a disentangled representation of the environment’s generative factors.

DARLA算法分为三个部分：（1）学习特征表示；（2）学习策略控制；（3）转换。

DARLA does not require target domain data to form its representations. Our approach utilises a three stage pipeline: 1) learning to see, 2) learning to act, 3) transfer.

训练领域和应用领域（source domain and target domain）

source domain / target domain
该迁移学习的的特点在于：
（1）训练数据和测试数据分布差别较大；
（2）在训练数据训练完成之后，在测试数据不进行学习

source domain 和 target domain 之前的数据差别在于：
（1）action space 共享；
（2）transition 和reward function 相似
（3）state space 差别较大
image_1bveknhaa1cbq3t98bh13qa7bd9.png-413.2kB

算法细则

整个算法现将高维的 state $S_{i}^o$ 投影到低维 $S_{i}^z$ ,实用的方法是非监督学习

In the process of doing so, the agent implicitly learns a function $F: S_{i}^o ->S_{i}^z$ that maps the typically high-dimensional raw observations $S_{i}^o$ to typically low-dimensional latent states $S_{i}^z$ ; followed by a policy function $\pi_{i}:S_{i}^z->A_i$ that maps the latent states $S_{i}^z$ to actions $a_i$
Such a source policy $\pi_s$ is likely to be based on an entangled latent state space $S_{s}^{z}$
Hence, DARLA is based on the idea that a good quality F learnt exclusively on the source domain $D_{S} \in M$ will zero-shot generalise to all target domains $D_{i} \in M$ , and therefore the source policy $\pi(a|S^{z}_{S};\theta)$ will also generalise to all target domains $D_{i} \in M$ out of the box.

这个算法分为三部分：
（1）学习特征表示，这部分是全文的关键部分，采用的是非监督学习的方法；
（2）用特征表示输入到强化学习的算法中（DQN，DDPG，A3C）；
（3）由sorce domain 向target domain 转换
image_1bvesh878f6on3n172dk46q4k16.png-140.9kB
image_1bvesp6bjh7b1suo1qe414tq1n599.png-284.2kB