Evolutional based RL algorithm

1. What is GA.

  • GA 算法是由Rainer, StornKenneth, Price等人在1997年提出,用于解决优化问题的算法。
  • GA 算法模仿了人类基因变异的行为,每一代随机初始化一定数量的个体通过mutation,crossover, selection等步骤,得到全局最优解。
  • 通过成千上万次的进化(迭代),算法最终能收敛到全局最优解。
    在这里插入图片描述
    该图描述了GA算法在二维空间中寻找最优解的过程。

2. What is ES, and the different between ES and GA.

  • ES策略是文 [1] 中作者提出来用于取代传统强化学习中PG策略的方法。
  • ES的核心思想和GA类似,但是又和GA稍有不同。
  • 下面给出了ES策略的伪代码:
    在这里插入图片描述
  • GA是用来求解最优化问题的算法,而ES则模仿了GA进化的过程,用来更行神经网络的参数,从而引导强化学习的方向。
  • GA的进化策略包括完整的mutation,crossover,selection。而ES则只是简单的叠加了噪声。
  • 下面给出GA的伪代码:(此图为搬运,如有侵权,请联系作者)
    在这里插入图片描述

3. The difference between ES and PG, and the benefits of ES over PG.

  • PG 策略:
    • 前向传播: 通过神经网络的前向传播,然后叠加一个噪声算出action对应的学习值。
    • 反向传播: 计算学习值的梯度,对误差进行反向传播。
  • ES 策略:
    • 用GA的思路直接扰动神经网络的参数,然后不计算梯度,直接计算出每个action对应的学习值。
  • ES策略的优点在文 [1] 中已详细罗列,下面列出几个核心优点:
  • ES相对于基于梯度的方法(GD,或者SGD)有着更强的探索能力,特别是到了迭代后期,不容易陷入局部最优解。
  • ES本身并行的特点,决定了它很适合用大规模的并行计算机阵列或者(GPU)和(FPGA)来进行并行加速。
  • 相对于基于PG的RL算法,ES策略在节省了大量运行时间的情况下获得了和前者差不多的效果,在某些有局部最优值的问题上甚至超过了前者。
  • 可见用ES策略来代替PG策略进行强化学习训练是非常有前景的。
  • 下面给出了ES策略的简单实现(python):
import numpy as np
solution = np.array([0.5, 0.1, -0.3])
def f(w): return -np.sum((w - solution)**2)

npop = 50      # 种群数
sigma = 0.1    # 噪点标准差
alpha = 0.001  # 学习率
w = np.random.randn(3) # 对 solution 的初始猜测
for i in range(300):
  N = np.random.randn(npop, 3)  # 产生噪点
  R = np.zeros(npop)
  for j in range(npop):
    w_try = w + sigma*N[j]
    R[j] = f(w_try)             # 得到环境奖励
  A = (R - np.mean(R)) / np.std(R)  # 归一化奖励
  w = w + alpha/(npop*sigma) * np.dot(N.T, A)   # 更新参数

4. A simple introduction to gym test suit.

CartPole-v0
在这里插入图片描述

关于该游戏的详细规则,请点击这里
MountainCar-v0
在这里插入图片描述
关于该游戏的详细规则,请点击这里
Pendulum-v0
在这里插入图片描述
关于该游戏的详细规则,请点击这里

  • 通过以下代码可以让计算机不做任何训练随机玩这几个游戏:
import gym
# env = gym.make('CartPole-v0')
env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()
  • 通过运行以上三个游戏让计算机用随机策略玩,只有非常非常小的概率可以达到游戏的目标。

5. Using ES based RL algorithm to solve some simple problems gym test suit.

  • 在这一章我们通过复现论文 [1] 的算法来简单的解决以上三个问题。
  • 算法的python框架如下:
def get_reward():
    # 目标函数适应度值的计算

def build_net():
    # 初始化神经网络

def train():
    # 对网络用多线程的方法进行并行训练
    rewards = [get_reward() for i in range(N_KID)]
# 迭代过程
build_net()
for g in range(N_GENERATION):
    train()
  • 为了使算法的结构更加清晰明了,我们直接使用numpy构建神经网络。
def build_net():
    def linear(n_in, n_out):  # network linear layer
        w = np.random.randn(n_in * n_out).astype(np.float32) * .1
        b = np.random.randn(n_out).astype(np.float32) * .1
        return (n_in, n_out), np.concatenate((w, b))
    s0, p0 = linear(CONFIG['n_feature'], 30)
    s1, p1 = linear(30, 20)
    s2, p2 = linear(20, CONFIG['n_action'])
    return [s0, s1, s2], np.concatenate((p0, p1, p2))
  • 此处采用向量构建了一个简单的3层网络的形式而不是矩阵是为了更好的进行并行计算。
def train(net_shapes, net_params, pool):
    # 生成噪点的 seed
    noise_seed = np.random.randint(0, 2 ** 32 - 1, size=N_KID, dtype=np.uint32) # 限制 seed 的范围.

    # 用多进程完成 get_reward 功能
    jobs = [pool.apply_async(get_reward, (这里是get_reward需要的数据, 比如 seed))
            for k_id in range(N_KID)]
    rewards = np.array([j.get() for j in jobs])

    cumulative_update = np.zeros_like(net_params)       # initialize updated values
    for k_id in range(N_KID):
        np.random.seed(noise_seed[k_id])                # reconstruct noise using seed
        cumulative_update += rewards[k_id] * np.random.randn(net_params.size)

    net_params = net_params + LR/(N_KID*SIGMA) * cumulative_update
    return net_params
  end
  • 在此处我们采用multiprocessing模块来并行计算进化过程。
def get_reward(shapes, params, env, ep_max_step, seed,):
    np.random.seed(seed)    # 使用 seed 按规律伪随机生成噪点
    params += SIGMA * np.random.randn(params.size)

    # 将 params 变成矩阵形式
    p = params_reshape(shapes, params)
    # 开始用 gym 模拟
    s = env.reset()
    ep_r = 0.
    for step in range(ep_max_step):
        a = get_action(p, s)    # 神经网络选择行为
        s, r, done, _ = env.step(a)
        ep_r += r
        if done: break
    return ep_r     # 返回回合奖励
  • 这里我们采用论文 [3] 提到的mirrored sampling的方法,来生成镜像噪点,加速ES的迭代更新。

6. Complete Code in python with tensorflow.

"""
This part of code is the DQN brain, which is a brain of the agent.
All decisions are made in here.
Using Tensorflow to build the neural network.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
Using:
Tensorflow: 1.0
gym: 0.8.0
"""

import numpy as np
import pandas as pd
import tensorflow as tf


# Deep Q Network off-policy
class DeepQNetwork:
    def __init__(
            self,
            n_actions,
            n_features,
            learning_rate=0.01,
            reward_decay=0.9,
            e_greedy=0.9,
            replace_target_iter=300,
            memory_size=500,
            batch_size=32,
            e_greedy_increment=None,
            output_graph=False,
    ):
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon_max = e_greedy
        self.replace_target_iter = replace_target_iter
        self.memory_size = memory_size
        self.batch_size = batch_size
        self.epsilon_increment = e_greedy_increment
        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max

        # total learning step
        self.learn_step_counter = 0

        # initialize zero memory [s, a, r, s_]
        self.memory = np.zeros((self.memory_size, n_features * 2 + 2))

        # consist of [target_net, evaluate_net]
        self._build_net()
        t_params = tf.get_collection('target_net_params')
        e_params = tf.get_collection('eval_net_params')
        self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]

        self.sess = tf.Session()

        if output_graph:
            # $ tensorboard --logdir=logs
            # tf.train.SummaryWriter soon be deprecated, use following
            tf.summary.FileWriter("logs/", self.sess.graph)

        self.sess.run(tf.global_variables_initializer())
        self.cost_his = []

    def _build_net(self):
        # ------------------ build evaluate_net ------------------
        self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input
        self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss
        with tf.variable_scope('eval_net'):
            # c_names(collections_names) are the collections to store variables
            c_names, n_l1, w_initializer, b_initializer = \
                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers

            # first layer. collections is used later when assign to target net
            with tf.variable_scope('l1'):
                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
                l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)

            # second layer. collections is used later when assign to target net
            with tf.variable_scope('l2'):
                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
                self.q_eval = tf.matmul(l1, w2) + b2

        with tf.variable_scope('loss'):
            self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
        with tf.variable_scope('train'):
            self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

        # ------------------ build target_net ------------------
        self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input
        with tf.variable_scope('target_net'):
            # c_names(collections_names) are the collections to store variables
            c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

            # first layer. collections is used later when assign to target net
            with tf.variable_scope('l1'):
                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
                l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)

            # second layer. collections is used later when assign to target net
            with tf.variable_scope('l2'):
                w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
                b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
                self.q_next = tf.matmul(l1, w2) + b2

    def store_transition(self, s, a, r, s_):
        if not hasattr(self, 'memory_counter'):
            self.memory_counter = 0

        transition = np.hstack((s, [a, r], s_))

        # replace the old memory with new memory
        index = self.memory_counter % self.memory_size
        self.memory[index, :] = transition

        self.memory_counter += 1

    def choose_action(self, observation):
        # to have batch dimension when feed into tf placeholder
        observation = observation[np.newaxis, :]

        if np.random.uniform() < self.epsilon:
            # forward feed the observation and get q value for every actions
            actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
            action = np.argmax(actions_value)
        else:
            action = np.random.randint(0, self.n_actions)
        return action

    def learn(self):
        # check to replace target parameters
        if self.learn_step_counter % self.replace_target_iter == 0:
            self.sess.run(self.replace_target_op)
            print('\ntarget_params_replaced\n')

        # sample batch memory from all memory
        if self.memory_counter > self.memory_size:
            sample_index = np.random.choice(self.memory_size, size=self.batch_size)
        else:
            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
        batch_memory = self.memory[sample_index, :]

        q_next, q_eval = self.sess.run(
            [self.q_next, self.q_eval],
            feed_dict={
                self.s_: batch_memory[:, -self.n_features:],  # fixed params
                self.s: batch_memory[:, :self.n_features],  # newest params
            })

        # change q_target w.r.t q_eval's action
        q_target = q_eval.copy()

        batch_index = np.arange(self.batch_size, dtype=np.int32)
        eval_act_index = batch_memory[:, self.n_features].astype(int)
        reward = batch_memory[:, self.n_features + 1]

        q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

        """
        For example in this batch I have 2 samples and 3 actions:
        q_eval =
        [[1, 2, 3],
         [4, 5, 6]]
        q_target = q_eval =
        [[1, 2, 3],
         [4, 5, 6]]
        Then change q_target with the real q_target value w.r.t the q_eval's action.
        For example in:
            sample 0, I took action 0, and the max q_target value is -1;
            sample 1, I took action 2, and the max q_target value is -2:
        q_target =
        [[-1, 2, 3],
         [4, 5, -2]]
        So the (q_target - q_eval) becomes:
        [[(-1)-(1), 0, 0],
         [0, 0, (-2)-(6)]]
        We then backpropagate this error w.r.t the corresponding action to network,
        leave other action as error=0 cause we didn't choose it.
        """

        # train eval network
        _, self.cost = self.sess.run([self._train_op, self.loss],
                                     feed_dict={self.s: batch_memory[:, :self.n_features],
                                                self.q_target: q_target})
        self.cost_his.append(self.cost)

        # increasing epsilon
        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
        self.learn_step_counter += 1

    def plot_cost(self):
        import matplotlib.pyplot as plt
        plt.plot(np.arange(len(self.cost_his)), self.cost_his)
        plt.ylabel('Cost')
        plt.xlabel('training steps')
        plt.show()
  • Reference

[1] Salimans T , Ho J , Chen X , et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning[J]. 2017.
[2] Brockhoff D , Auger A , Hansen N , et al. Mirrored Sampling and Sequential Selection for Evolution Strategies[C]// International Conference on Parallel Problem Solving from Nature: Part I. Springer, Berlin, Heidelberg, 2010.
[3] Rainer, StornKenneth, Price. Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces[J]. Journal of Global Optimization, 1997.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值