强化学习 | Mirror Learning

一辄

已于 2023-10-05 10:03:14 修改

阅读量2.5k

点赞数 24

分类专栏：强化学习文章标签：强化学习增强学习 Mirror Learning 镜面学习 descent

于 2022-03-24 17:27:49 首次发布

本文链接：https://blog.csdn.net/qq_45832958/article/details/123698273

版权

强化学习专栏收录该内容

4 篇文章 37 订阅

订阅专栏

学习汇报：

🌱 看了论文 Mirror Learning: A Unifying Framework of Policy Optimisation . 这篇blog对该论文的核心内容进行整理，限于时间未整理细节推导部分。

🌱 因为一些性质，查了非常多凸优化的内容，有空再整理。

🌱 前序文章1：强化学习入门笔记 | UCL silver RL | UC Berkely cs285 DRL

🌱 前序文章2：强化学习 | 策略梯度 | Natural PG | TRPO | PPO

🌱 前序文章3：强化学习 | Multi Agents | Trust Region | HATRPO | HAPPO

🌱 此篇文章4：强化学习 | Mirror Learning

目录：

🌻 论文概述

🌻 论文理解

🌴 Mirror Learning framwork (template)

🌴 GPI TRPO PPO (Instance)

论文概述

目前，主流的 DRL 算法多基于 GPI (General Policy Improvement) 或 TRL (Trust Region Learning) 这两种严谨的理论设计。

然而，严格遵循这些理论框架的算法have proven unscalable.

Actually, during the development of clip-PPO, a variant more closely related to TRL was considered. $\begin{equation} \mathbb{E}_{\mathrm{s} \sim \rho_{\pi_{\mathrm{old}}}, \mathrm{a} \sim \bar{\pi}} \left[A_{\pi_{\mathrm{old}}}(\mathrm{s}, \mathrm{a})\right]-\tau \overline{\mathrm{D}}_{\mathrm{KL}}\left(\pi_{\mathrm{old}}, \bar{\pi}\right) \end{equation}$ , wher scales $\begin{equation} \mathcal{\tau} \end{equation}$ up and down by a constant, depending on whether the KL-div induced by the update exceeded or subceeded some target level. Intriguingly, although more closely-related to TRL, failed in practice.好像说的是PPO1吧，adaptive penalty那个

因此，为了在理论和实践中达到一个平衡，许多DRL算法对理论做了approximation，"analogy". 但是，该approximation process 实际 violate 了部分 GPI / TRL assumptions.

例如，TRPO将TRL中推导的 KL penalty 改为 hard constraint，PPO则使用clip函数来近似这个策略更新的约束

TRPO将惩罚项化为约束项，是因为在实践中，penalty前的超参并不好调整；PPO摒弃KL div，而是改用clip来约束新旧策略间的更新步伐，简化求解复杂度，也是出于现实运算效率的考量

既然TRPO，PPO等算法由于做了"approximation"，对严格推导的理论有所违背，为什么在实践中表现还很卓越，如何解释呢？

其实，这些 empirical success algorithms 是有理论支撑的，可以用这篇论文中提出的 Mirror Learning framework 来解释。而且作者证明了理论上Mirror Learning 的 monotonic improvement property, and converges to the optimal returns.

关于为什么起 Mirror Learning 这个框架名字：

Mirror Learning is not a method of solving regularised problems through mirror descent. Instead, it is a very general class of algorithms that solve the classical MDP.

The term mirror, however, is inspired by the intuition behind mirror descent, which solves the image of the original problem under mirror map — similarly, we define the mirror operator (drift functional $\begin{equation} \mathfrak{D} \end{equation}$ )

再宏观理解一下 Mirror Learning

Mirror Learning provides RL algorithm designers with a template. New instances of it can be obtained by altering the drift $\begin{equation} \mathfrak{D}^{\nu} \end{equation}$ , the neighbourhood operator $\begin{equation} \mathcal{N} \end{equation}$ , and the sampling distribution function $\begin{equation} \beta_{\pi} \end{equation}$ instance. TRPO、PPO 等算法是Mirror Learning这个template的 instance.

关于 $\begin{equation} \mathfrak{D}^{\nu} \end{equation}$ 、 $\begin{equation} \mathcal{N} \end{equation}$ 、 $\begin{equation} \beta_{\pi} \end{equation}$ 参数简单解释一下（具体定义见于行文细节part中）：

$\begin{equation} \mathfrak{D}^{\nu} \end{equation}$ is a drift functional, an abstract definition which can be a specific constraint designed by RL algorithm designers. Practictioners can choose $\begin{equation} \mathfrak{D} \end{equation}$ to describe a cost that they want to limit throughout training. For example, setting $\begin{equation} \mathfrak{D} \end{equation}$ = risk or $\begin{equation} \mathfrak{D} \end{equation}$ = memory. In the last experiments part of this article, author selected KL-divergence, squared L2 distance, squared total variation distance as drift functions.

The trivial neighbourhood operator is $\begin{equation} \mathcal{N} \equiv \Pi \end{equation}$

$\begin{equation} \beta_{\pi} \end{equation}$ is a state distribution, referred to as a sampling distribution

具体 Mirror Learning 的原理推导见行文细节part中

论文理解

论文主要工作已在上一part描述。限于时间，就不详细整理了。

对于论文我做了精读，获得带个人笔记的论文pdf，请三连并留下个人邮箱

作者的行文顺序为：

在Introduction中介绍GPI和TRL原理严格定义在实际应用的不便性，以及做了approximation的现有工作。而后介绍了论文的主要工作——Mirror Learining。

第二部分Background中首先介绍DRL common notation，而后对GPI和TRL进行了介绍。

第三部分对Mirror Learning framework进行了推导。第四部分介绍了Mirror Learning下instance GPI,TRPO,PPO。

第五部分是Related Work。第六部分联系了图， "use mirror learning to make a surprising connection between RL and graph theory".最后在简单实验环境上验证了下Mirror Learning framework。

以下是论文中的核心的第三、第四部分，分别介绍了Mirror Learning和从Mirror Learning统一视角下看TRPO等instance.