Evolutional based RL algorithm

最新推荐文章于 2021-12-10 09:56:48 发布

WuYuFffan

最新推荐文章于 2021-12-10 09:56:48 发布

阅读量228

点赞数

分类专栏： QL 文章标签：强化学习

本文链接：https://blog.csdn.net/WuYuFffan/article/details/108694177

版权

QL 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Evolutional based RL algorithm

1. What is GA.

GA 算法是由Rainer, StornKenneth, Price等人在1997年提出，用于解决优化问题的算法。
GA 算法模仿了人类基因变异的行为，每一代随机初始化一定数量的个体通过mutation,crossover, selection等步骤,得到全局最优解。
通过成千上万次的进化（迭代），算法最终能收敛到全局最优解。

该图描述了GA算法在二维空间中寻找最优解的过程。

2. What is ES, and the different between ES and GA.

ES策略是文 [1] 中作者提出来用于取代传统强化学习中PG策略的方法。
ES的核心思想和GA类似，但是又和GA稍有不同。
下面给出了ES策略的伪代码：
GA是用来求解最优化问题的算法,而ES则模仿了GA进化的过程，用来更行神经网络的参数,从而引导强化学习的方向。
GA的进化策略包括完整的mutation,crossover,selection。而ES则只是简单的叠加了噪声。
下面给出GA的伪代码:（此图为搬运，如有侵权，请联系作者）

3. The difference between ES and PG, and the benefits of ES over PG.

PG 策略：
- 前向传播: 通过神经网络的前向传播，然后叠加一个噪声算出action对应的学习值。
- 反向传播: 计算学习值的梯度，对误差进行反向传播。
ES 策略:
- 用GA的思路直接扰动神经网络的参数，然后不计算梯度，直接计算出每个action对应的学习值。
ES策略的优点在文 [1] 中已详细罗列，下面列出几个核心优点:
ES相对于基于梯度的方法(GD,或者SGD)有着更强的探索能力，特别是到了迭代后期,不容易陷入局部最优解。
ES本身并行的特点,决定了它很适合用大规模的并行计算机阵列或者(GPU)和(FPGA)来进行并行加速。
相对于基于PG的RL算法,ES策略在节省了大量运行时间的情况下获得了和前者差不多的效果,在某些有局部最优值的问题上甚至超过了前者。
可见用ES策略来代替PG策略进行强化学习训练是非常有前景的。
下面给出了ES策略的简单实现(python):

import numpy as np
solution = np.array([0.5, 0.1, -0.3])
def f(w): return -np.sum((w - solution)**2)

npop = 50      # 种群数
sigma = 0.1    # 噪点标准差
alpha = 0.001  # 学习率
w = np.random.randn(3) # 对 solution 的初始猜测
for i in range(300):
  N = np.random.randn(npop, 3)  # 产生噪点
  R = np.zeros(npop)
  for j in range(npop):
    w_try = w + sigma*N[j]
    R[j] = f(w_try)             # 得到环境奖励
  A = (R - np.mean(R)) / np.std(R)  # 归一化奖励
  w = w + alpha/(npop*sigma) * np.dot(N.T, A)   # 更新参数

4. A simple introduction to gym test suit.

CartPole-v0
在这里插入图片描述

关于该游戏的详细规则,请点击这里。
MountainCar-v0
在这里插入图片描述
关于该游戏的详细规则,请点击这里。
Pendulum-v0

关于该游戏的详细规则,请点击这里。

通过以下代码可以让计算机不做任何训练随机玩这几个游戏:

import gym
# env = gym.make('CartPole-v0')
env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()

通过运行以上三个游戏让计算机用随机策略玩，只有非常非常小的概率可以达到游戏的目标。

5. Using ES based RL algorithm to solve some simple problems gym test suit.

在这一章我们通过复现论文 [1] 的算法来简单的解决以上三个问题。
算法的python框架如下:

def get_reward():
    # 目标函数适应度值的计算

def build_net():
    # 初始化神经网络

def train():
    # 对网络用多线程的方法进行并行训练
    rewards = [get_reward() for i in range(N_KID)]
# 迭代过程
build_net()
for g in range(N_GENERATION):
    train()

为了使算法的结构更加清晰明了，我们直接使用numpy构建神经网络。

def build_net():
    def linear(n_in, n_out):  # network linear layer
        w = np.random.randn(n_in * n_out).astype(np.float32) * .1
        b = np.random.randn(n_out).astype(np.float32) * .1
        return (n_in, n_out), np.concatenate((w, b))
    s0, p0 = linear(CONFIG['n_feature'], 30)
    s1, p1 = linear(30, 20)
    s2, p2 = linear(20, CONFIG['n_action'])
    return [s0, s1, s2], np.concatenate((p0, p1, p2))

此处采用向量构建了一个简单的3层网络的形式而不是矩阵是为了更好的进行并行计算。

def train(net_shapes, net_params, pool):
    # 生成噪点的 seed
    noise_seed = np.random.randint(0, 2 ** 32 - 1, size=N_KID, dtype=np.uint32) # 限制 seed 的范围.

    # 用多进程完成 get_reward 功能
    jobs = [pool.apply_async(get_reward, (这里是get_reward需要的数据, 比如 seed))
            for k_id in range(N_KID)]
    rewards = np.array([j.get() for j in jobs])

    cumulative_update = np.zeros_like(net_params)       # initialize updated values
    for k_id in range(N_KID):
        np.random.seed(noise_seed[k_id])                # reconstruct noise using seed
        cumulative_update += rewards[k_id] * np.random.randn(net_params.size)

    net_params = net_params + LR/(N_KID*SIGMA) * cumulative_update
    return net_params
  end

在此处我们采用multiprocessing模块来并行计算进化过程。

def get_reward(shapes, params, env, ep_max_step, seed,):
    np.random.seed(seed)    # 使用 seed 按规律伪随机生成噪点
    params += SIGMA * np.random.randn(params.size)

    # 将 params 变成矩阵形式
    p = params_reshape(shapes, params)
    # 开始用 gym 模拟
    s = env.reset()
    ep_r = 0.
    for step in range(ep_max_step):
        a = get_action(p, s)    # 神经网络选择行为
        s, r, done, _ = env.step(a)
        ep_r += r
        if done: break
    return ep_r     # 返回回合奖励

这里我们采用论文 [3] 提到的mirrored sampling的方法，来生成镜像噪点，加速ES的迭代更新。

6. Complete Code in python with tensorflow.

"""
This part of code is the DQN brain, which is a brain of the agent.
All decisions are made in here.
Using Tensorflow to build the neural network.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
Using:
Tensorflow: 1.0
gym: 0.8.0
"""

import numpy as np
import pandas as pd
import tensorflow as tf


# Deep Q Network off-policy
class DeepQNetwork:
    def __init__(
            self,
            n_actions,
            n_features,
            learning_rate=0.01,
            reward_decay=0.9,
            e_greedy=0.9,
            replace_target_iter=300,
            memory_size=500,
            batch_size=32,
            e_greedy_increment=None,
            output_graph=False,
    ):
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon_max = e_greedy
        self.replace_target_iter = replace_target_iter
        self.memory_size = memory_size
        self.batch_size = batch_size
        self.epsilon_increment = e_greedy_increment
        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max

        # total learning step
        self.learn_step_counter = 0

        # initialize zero memory [s, a, r, s_]
        self.memory = np.zeros((self.memory_size, n_features * 2 + 2))

        # consist of [target_net, evaluate_net]
        self._build_net()
        t_params = tf.get_collection('target_net_params')
        e_params = tf.get_collection('eval_net_params')
        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]

        self.sess = tf.Session()

        if output_graph:
            # $ tensorboard --logdir=logs
            # tf.train.SummaryWriter soon be deprecated, use following
            tf.summary.FileWriter("logs/", self.sess.graph)

        self.sess.run(tf.global_variables_initializer())
        self.cost_his = []

    def _build_net(self):
        # ------------------ build evaluate_net ------------------
        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input
        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss
        with tf.variable_scope('eval_net'):
            # c_names(collections_names) are the collections to store variables
            c_names, n_l1, w_initializer, b_initializer = \
                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers

            # first layer. collections is used later when assign to target net
            with tf.variable_scope('l1'):
                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
                l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)

            # second layer. collections is used later when assign to target net
            with tf.variable_scope('l2'):
                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
                self.q_eval = tf.matmul(l1, w2) + b2

        with tf.variable_scope('loss'):
            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
        with tf.variable_scope('train'):
            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

        # ------------------ build target_net ------------------
        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input
        with tf.variable_scope('target_net'):
            # c_names(collections_names) are the collections to store variables
            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

            # first layer. collections is used later when assign to target net
            with tf.variable_scope('l1'):
                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
                l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)

            # second layer. collections is used later when assign to target net
            with tf.variable_scope('l2'):
                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
                self.q_next = tf.matmul(l1, w2) + b2

    def store_transition(self, s, a, r, s_):
        if not hasattr(self, 'memory_counter'):
            self.memory_counter = 0

        transition = np.hstack((s, [a, r], s_))

        # replace the old memory with new memory
        index = self.memory_counter % self.memory_size
        self.memory[index, :] = transition

        self.memory_counter += 1

    def choose_action(self, observation):
        # to have batch dimension when feed into tf placeholder
        observation = observation[np.newaxis, :]

        if np.random.uniform() < self.epsilon:
            # forward feed the observation and get q value for every actions
            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
            action = np.argmax(actions_value)
        else:
            action = np.random.randint(0, self.n_actions)
        return action

    def learn(self):
        # check to replace target parameters
        if self.learn_step_counter % self.replace_target_iter == 0:
            self.sess.run(self.replace_target_op)
            print('\ntarget_params_replaced\n')

        # sample batch memory from all memory
        if self.memory_counter > self.memory_size:
            sample_index = np.random.choice(self.memory_size, size=self.batch_size)
        else:
            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
        batch_memory = self.memory[sample_index, :]

        q_next, q_eval = self.sess.run(
            [self.q_next, self.q_eval],
            feed_dict={
                self.s_: batch_memory[:, -self.n_features:],  # fixed params
                self.s: batch_memory[:, :self.n_features],  # newest params
            })

        # change q_target w.r.t q_eval's action
        q_target = q_eval.copy()

        batch_index = np.arange(self.batch_size, dtype=np.int32)
        eval_act_index = batch_memory[:, self.n_features].astype(int)
        reward = batch_memory[:, self.n_features + 1]

        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

        """
        For example in this batch I have 2 samples and 3 actions:
        q_eval =
        [[1, 2, 3],
         [4, 5, 6]]
        q_target = q_eval =
        [[1, 2, 3],
         [4, 5, 6]]
        Then change q_target with the real q_target value w.r.t the q_eval's action.
        For example in:
            sample 0, I took action 0, and the max q_target value is -1;
            sample 1, I took action 2, and the max q_target value is -2:
        q_target =
        [[-1, 2, 3],
         [4, 5, -2]]
        So the (q_target - q_eval) becomes:
        [[(-1)-(1), 0, 0],
         [0, 0, (-2)-(6)]]
        We then backpropagate this error w.r.t the corresponding action to network,
        leave other action as error=0 cause we didn't choose it.
        """

        # train eval network
        _, self.cost = self.sess.run([self._train_op, self.loss],
                                     feed_dict={self.s: batch_memory[:, :self.n_features],
                                                self.q_target: q_target})
        self.cost_his.append(self.cost)

        # increasing epsilon
        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
        self.learn_step_counter += 1

    def plot_cost(self):
        import matplotlib.pyplot as plt
        plt.plot(np.arange(len(self.cost_his)), self.cost_his)
        plt.ylabel('Cost')
        plt.xlabel('training steps')
        plt.show()

Reference

[1] Salimans T , Ho J , Chen X , et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning[J]. 2017.
[2] Brockhoff D , Auger A , Hansen N , et al. Mirrored Sampling and Sequential Selection for Evolution Strategies[C]// International Conference on Parallel Problem Solving from Nature: Part I. Springer, Berlin, Heidelberg, 2010.
[3] Rainer, StornKenneth, Price. Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces[J]. Journal of Global Optimization, 1997.

WuYuFffan

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Evolutional based RL algorithm

Evolutional based RL algorithm1. What is GA.2. What is ES, and the different between ES and GA.3. The difference between ES and PG, and the benefits of ES over PG.4. A simple introduction to gym test suit.5. Using ES based RL algorithm to solve some simple
复制链接

扫一扫

专栏目录