基于百度飞桨PaddlePaddle和PARL复现PPO强化学习算法

最新推荐文章于 2024-09-21 22:06:36 发布

AItrust

最新推荐文章于 2024-09-21 22:06:36 发布

阅读量2.6k

点赞数 2

分类专栏：强化学习文章标签：机器学习强化学习算法人工智能

本文链接：https://blog.csdn.net/qq_42067550/article/details/107528958

版权

文章目录

一、PPO训练效果展示（Mujoco HalfCheetah-v2）
二、策略优化算法发展回顾：
三、PPO 算法论文阅读
四、基于百度 PaddlePaddle 和 Parl 的 PPO 算法程序实现
五、强化学习基础课程推荐

一、PPO训练效果展示（Mujoco HalfCheetah-v2）

强化学习算法PPO让猎豹学会奔跑！

运行 300,000 步达到 1000 分
运行 600,000 步达到 1500 分
运行 2,000,000 步达到 2500 分以上
最终可以达到 4000-5000 分

二、策略优化算法发展回顾：

PG
- sample efficience 很低：由于采样和优化，用的是一套策略
- 训练过程不稳定，容易崩溃（可能由于样本的关联度太高造成）。错误会被放大，比如随机采集到了查的，和环境交互后更差，得到的也是一堆差的数据，很难从错误恢复，导致崩溃
如何使的训练更稳定呢？
- 引入 Trust region 的机制，使得产生的 gradient 总是在一个安全的范围里（平缓更新？）
- 或者使用 natural policy gradient，一种二阶优化的方式（SGD是用的一阶近似，所以计算得到的 policy 不准确）
如何变成 off-policy？（数据采集和优化使用不同 policy）
- 使用 Important sampling 的方法（比如 TRPO）
对 TRPO 进一步改进：ACKTR
- 提升 TRPO 的计算效率：把 Fisher Information Matrix 求逆的过程，用 Kronecker-factored approximation curvature (K-FAC) 替代。结果是训练速度大幅加快
对 TRPO 进行简化：Proximal Policy Optimization (PPO)
- 用 unconstrained form 把 Natural gradient 和 TRPO 的 loss function 结合起来
- 效果和 TRPO类似，但是由于用的是一阶的 SGD 方法优化，所以速度更快
PPO with clipping
- 使用 clip 的方式，限制 ratio 的更新，让 policy output 不会有太激进的变化，让更新变得稳定
- 大部分的 PPO 算法用了这种形式
- 仅几行代码即可把经典 PG 算法改写成 PPO 的形式，所以很容易实现

三、PPO 算法论文阅读

论文网址：

Proximal Policy Optimization Algorithms

1. Introduction

介绍了各种算法的优劣，以及PPO算法的优势：

Q-learning
- 对于很多简单的问题无法解决
- 而且算法难以理解
vanilla policy gradient
- poor data efficiency 数据利用率很差
- poor robustness 鲁棒性差
TRPO
- complicated 太复杂
- 与引入 noise 和 paramater sharing 的结构不兼容
PPO
- 数据利用率高
- 模型可靠度高（适用范围广），类似 TRPO
- 模型简洁易懂
- 创新地引入了 clipped probability ratio

2. Background: Policy Optimization

2.1 Policy Gradient Methods

回顾了 PG 算法的核心
同时解释为什么会有不稳定的问题：使用同一条轨迹进行多步优化

2.2 Trust Region Methods

回顾 TRPO 算法的核心
解释 surrogate objective
提到当使用 unconstraint 的方法时，固定的惩罚系数是不行的

3. Clipped Surrogate Objective

在这里插入图片描述

解释了 PPO 如何通过截断 probability ratio 达到 TRPO 同样的效果
同时保持了算法的简易性和可理解性

4. Adaptive KL Penalty Coefficient

在这里插入图片描述

用 KL divergence 来代替截断的方法
惩罚系数的设置需要对应做变化

5. Algorithm

在这里插入图片描述

解释了算法的具体实现
policy function 和 value function

6. Experiments

6.1 Comparison of Surrogate Objective

在这里插入图片描述

首先对比没有 clipping or penalty ，Clipping，KL penalty 之间的区别
结果是后面两个表现更好
Clipping 比 KL 表现更好

6.2 Comparison to Other Algorithms in the Continuous Domain

在这里插入图片描述

PPO 比其他算法收敛更快（前1000000 timesteps）
适应范围更广
效果更好

6.3 Showcase in the Continuous Domain: Humanoid Running and Steering

在这里插入图片描述

在复杂的连续控制环境下，PPO取得的效果非常不错

6.4 Comparison to Other Algorithms on the Atari Domain

在这里插入图片描述

在 Atari 游戏上面进行测试，证明了 PPO 在确保效果的前提下，速度更快

7. Conclusion

总结 PPO 的优势：稳定，可靠，简单

四、基于百度 PaddlePaddle 和 Parl 的 PPO 算法程序实现

1. 相关依赖库

python3.5+
paddlepaddle>=1.6.1
parl
gym
tqdm
mujoco-py>=1.50.1.0

1.1 gym 和 parl 的介绍

请参考：

强化学习入门（一）强化学习的基础概念及Gym库，Parl库介绍

1.2 Mujoco 及 mujoco-py 的安装

请参考：

强化学习环境：MuJoCo 安装踩坑记录（2020年7月18日）

2. 主程序

#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import gym
import numpy as np
import parl
from mujoco_agent import MujocoAgent
from mujoco_model import MujocoModel
from parl.utils import logger, action_mapping
from parl.utils.rl_utils import calc_gae, calc_discount_sum_rewards
from scaler import Scaler


def run_train_episode(env, agent, scaler):
    obs = env.reset()
    observes, actions, rewards, unscaled_obs = [], [], [], []
    step = 0.0
    scale, offset = scaler.get()
    scale[-1] = 1.0  # don't scale time step feature
    offset[-1] = 0.0  # don't offset time step feature
    while True:
        obs = obs.reshape((1, -1))
        obs = np.append(obs, [[step]], axis=1)  # add time step feature
        unscaled_obs.append(obs)
        obs = (obs - offset) * scale  # center and scale observations
        obs = obs.astype('float32')
        observes.append(obs)

        action = agent.policy_sample(obs)
        action = np.clip(action, -1.0, 1.0)
        action = action_mapping(action, env.action_space.low[0],
                                env.action_space.high[0])

        action = action.reshape((1, -1)).astype('float32')
        actions.append(action)

        obs, reward, done, _ = env.step(np.squeeze(action))
        rewards.append(reward)
        step += 1e-3  # increment time step feature

        if done:
            break

    return (np.concatenate(observes), np.concatenate(actions),
            np.array(rewards, dtype='float32'), np.concatenate(unscaled_obs))


# 评估算法
def run_evaluate_episode(env, agent, scaler):
    obs = env.reset()
    rewards = []
    step = 0.0
    scale, offset = scaler.get()
    scale[-1] = 1.0  # don't scale time step feature
    offset[-1] = 0.0  # don't offset time step feature
    while True:
        obs = obs.reshape((1, -1))
        obs = np.append(obs, [[step]], axis=1)  # add time step feature
        obs = (obs - offset) * scale  # center and scale observations
        obs = obs.astype('float32')

        action = agent.policy_predict(obs)
        action = action_mapping(action, env.action_space.low[0],
                                env.action_space.high[0])

        obs, reward, done, _ = env.step(np.squeeze(action))
        rewards.append(reward)

        step += 1e-3  # increment time step feature

        if done:

最低0.47元/天解锁文章