Evolutional based RL algorithm
1. What is GA.
- GA 算法是由Rainer, StornKenneth, Price等人在1997年提出,用于解决优化问题的算法。
- GA 算法模仿了人类基因变异的行为,每一代随机初始化一定数量的个体通过mutation,crossover, selection等步骤,得到全局最优解。
- 通过成千上万次的进化(迭代),算法最终能收敛到全局最优解。
该图描述了GA算法在二维空间中寻找最优解的过程。
2. What is ES, and the different between ES and GA.
- ES策略是文 [1] 中作者提出来用于取代传统强化学习中PG策略的方法。
- ES的核心思想和GA类似,但是又和GA稍有不同。
- 下面给出了ES策略的伪代码:
- GA是用来求解最优化问题的算法,而ES则模仿了GA进化的过程,用来更行神经网络的参数,从而引导强化学习的方向。
- GA的进化策略包括完整的mutation,crossover,selection。而ES则只是简单的叠加了噪声。
- 下面给出GA的伪代码:(此图为搬运,如有侵权,请联系作者)
3. The difference between ES and PG, and the benefits of ES over PG.
- PG 策略:
- 前向传播: 通过神经网络的前向传播,然后叠加一个噪声算出action对应的学习值。
- 反向传播: 计算学习值的梯度,对误差进行反向传播。
- ES 策略:
- 用GA的思路直接扰动神经网络的参数,然后不计算梯度,直接计算出每个action对应的学习值。
- ES策略的优点在文 [1] 中已详细罗列,下面列出几个核心优点:
- ES相对于基于梯度的方法(GD,或者SGD)有着更强的探索能力,特别是到了迭代后期,不容易陷入局部最优解。
- ES本身并行的特点,决定了它很适合用大规模的并行计算机阵列或者(GPU)和(FPGA)来进行并行加速。
- 相对于基于PG的RL算法,ES策略在节省了大量运行时间的情况下获得了和前者差不多的效果,在某些有局部最优值的问题上甚至超过了前者。
- 可见用ES策略来代替PG策略进行强化学习训练是非常有前景的。
- 下面给出了ES策略的简单实现(python):
import numpy as np
solution = np.array([0.5, 0.1, -0.3])
def f(w): return -np.sum((w - solution)**2)
npop = 50 # 种群数
sigma = 0.1 # 噪点标准差
alpha = 0.001 # 学习率
w = np.random.randn(3) # 对 solution 的初始猜测
for i in range(300):
N = np.random.randn(npop, 3) # 产生噪点
R = np.zeros(npop)
for j in range(npop):
w_try = w + sigma*N[j]
R[j] = f(w_try) # 得到环境奖励
A = (R - np.mean(R)) / np.std(R) # 归一化奖励
w = w + alpha/(npop*sigma) * np.dot(N.T, A) # 更新参数
4. A simple introduction to gym test suit.
CartPole-v0
关于该游戏的详细规则,请点击这里。
MountainCar-v0
关于该游戏的详细规则,请点击这里。
Pendulum-v0
关于该游戏的详细规则,请点击这里。
- 通过以下代码可以让计算机不做任何训练随机玩这几个游戏:
import gym
# env = gym.make('CartPole-v0')
env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample()) # take a random action
env.close()
- 通过运行以上三个游戏让计算机用随机策略玩,只有非常非常小的概率可以达到游戏的目标。
5. Using ES based RL algorithm to solve some simple problems gym test suit.
- 在这一章我们通过复现论文 [1] 的算法来简单的解决以上三个问题。
- 算法的python框架如下:
def get_reward():
# 目标函数适应度值的计算
def build_net():
# 初始化神经网络
def train():
# 对网络用多线程的方法进行并行训练
rewards = [get_reward() for i in range(N_KID)]
# 迭代过程
build_net()
for g in range(N_GENERATION):
train()
- 为了使算法的结构更加清晰明了,我们直接使用
numpy
构建神经网络。
def build_net():
def linear(n_in, n_out): # network linear layer
w = np.random.randn(n_in * n_out).astype(np.float32) * .1
b = np.random.randn(n_out).astype(np.float32) * .1
return (n_in, n_out), np.concatenate((w, b))
s0, p0 = linear(CONFIG['n_feature'], 30)
s1, p1 = linear(30, 20)
s2, p2 = linear(20, CONFIG['n_action'])
return [s0, s1, s2], np.concatenate((p0, p1, p2))
- 此处采用向量构建了一个简单的3层网络的形式而不是矩阵是为了更好的进行并行计算。
def train(net_shapes, net_params, pool):
# 生成噪点的 seed
noise_seed = np.random.randint(0, 2 ** 32 - 1, size=N_KID, dtype=np.uint32) # 限制 seed 的范围.
# 用多进程完成 get_reward 功能
jobs = [pool.apply_async(get_reward, (这里是get_reward需要的数据, 比如 seed))
for k_id in range(N_KID)]
rewards = np.array([j.get() for j in jobs])
cumulative_update = np.zeros_like(net_params) # initialize updated values
for k_id in range(N_KID):
np.random.seed(noise_seed[k_id]) # reconstruct noise using seed
cumulative_update += rewards[k_id] * np.random.randn(net_params.size)
net_params = net_params + LR/(N_KID*SIGMA) * cumulative_update
return net_params
end
- 在此处我们采用
multiprocessing
模块来并行计算进化过程。
def get_reward(shapes, params, env, ep_max_step, seed,):
np.random.seed(seed) # 使用 seed 按规律伪随机生成噪点
params += SIGMA * np.random.randn(params.size)
# 将 params 变成矩阵形式
p = params_reshape(shapes, params)
# 开始用 gym 模拟
s = env.reset()
ep_r = 0.
for step in range(ep_max_step):
a = get_action(p, s) # 神经网络选择行为
s, r, done, _ = env.step(a)
ep_r += r
if done: break
return ep_r # 返回回合奖励
- 这里我们采用论文 [3] 提到的
mirrored sampling
的方法,来生成镜像噪点,加速ES的迭代更新。
6. Complete Code in python with tensorflow.
"""
This part of code is the DQN brain, which is a brain of the agent.
All decisions are made in here.
Using Tensorflow to build the neural network.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
Using:
Tensorflow: 1.0
gym: 0.8.0
"""
import numpy as np
import pandas as pd
import tensorflow as tf
# Deep Q Network off-policy
class DeepQNetwork:
def __init__(
self,
n_actions,
n_features,
learning_rate=0.01,
reward_decay=0.9,
e_greedy=0.9,
replace_target_iter=300,
memory_size=500,
batch_size=32,
e_greedy_increment=None,
output_graph=False,
):
self.n_actions = n_actions
self.n_features = n_features
self.lr = learning_rate
self.gamma = reward_decay
self.epsilon_max = e_greedy
self.replace_target_iter = replace_target_iter
self.memory_size = memory_size
self.batch_size = batch_size
self.epsilon_increment = e_greedy_increment
self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max
# total learning step
self.learn_step_counter = 0
# initialize zero memory [s, a, r, s_]
self.memory = np.zeros((self.memory_size, n_features * 2 + 2))
# consist of [target_net, evaluate_net]
self._build_net()
t_params = tf.get_collection('target_net_params')
e_params = tf.get_collection('eval_net_params')
self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]
self.sess = tf.Session()
if output_graph:
# $ tensorboard --logdir=logs
# tf.train.SummaryWriter soon be deprecated, use following
tf.summary.FileWriter("logs/", self.sess.graph)
self.sess.run(tf.global_variables_initializer())
self.cost_his = []
def _build_net(self):
# ------------------ build evaluate_net ------------------
self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s') # input
self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target') # for calculating loss
with tf.variable_scope('eval_net'):
# c_names(collections_names) are the collections to store variables
c_names, n_l1, w_initializer, b_initializer = \
['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) # config of layers
# first layer. collections is used later when assign to target net
with tf.variable_scope('l1'):
w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)
# second layer. collections is used later when assign to target net
with tf.variable_scope('l2'):
w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
self.q_eval = tf.matmul(l1, w2) + b2
with tf.variable_scope('loss'):
self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
with tf.variable_scope('train'):
self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)
# ------------------ build target_net ------------------
self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_') # input
with tf.variable_scope('target_net'):
# c_names(collections_names) are the collections to store variables
c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]
# first layer. collections is used later when assign to target net
with tf.variable_scope('l1'):
w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)
# second layer. collections is used later when assign to target net
with tf.variable_scope('l2'):
w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
self.q_next = tf.matmul(l1, w2) + b2
def store_transition(self, s, a, r, s_):
if not hasattr(self, 'memory_counter'):
self.memory_counter = 0
transition = np.hstack((s, [a, r], s_))
# replace the old memory with new memory
index = self.memory_counter % self.memory_size
self.memory[index, :] = transition
self.memory_counter += 1
def choose_action(self, observation):
# to have batch dimension when feed into tf placeholder
observation = observation[np.newaxis, :]
if np.random.uniform() < self.epsilon:
# forward feed the observation and get q value for every actions
actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
action = np.argmax(actions_value)
else:
action = np.random.randint(0, self.n_actions)
return action
def learn(self):
# check to replace target parameters
if self.learn_step_counter % self.replace_target_iter == 0:
self.sess.run(self.replace_target_op)
print('\ntarget_params_replaced\n')
# sample batch memory from all memory
if self.memory_counter > self.memory_size:
sample_index = np.random.choice(self.memory_size, size=self.batch_size)
else:
sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
batch_memory = self.memory[sample_index, :]
q_next, q_eval = self.sess.run(
[self.q_next, self.q_eval],
feed_dict={
self.s_: batch_memory[:, -self.n_features:], # fixed params
self.s: batch_memory[:, :self.n_features], # newest params
})
# change q_target w.r.t q_eval's action
q_target = q_eval.copy()
batch_index = np.arange(self.batch_size, dtype=np.int32)
eval_act_index = batch_memory[:, self.n_features].astype(int)
reward = batch_memory[:, self.n_features + 1]
q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)
"""
For example in this batch I have 2 samples and 3 actions:
q_eval =
[[1, 2, 3],
[4, 5, 6]]
q_target = q_eval =
[[1, 2, 3],
[4, 5, 6]]
Then change q_target with the real q_target value w.r.t the q_eval's action.
For example in:
sample 0, I took action 0, and the max q_target value is -1;
sample 1, I took action 2, and the max q_target value is -2:
q_target =
[[-1, 2, 3],
[4, 5, -2]]
So the (q_target - q_eval) becomes:
[[(-1)-(1), 0, 0],
[0, 0, (-2)-(6)]]
We then backpropagate this error w.r.t the corresponding action to network,
leave other action as error=0 cause we didn't choose it.
"""
# train eval network
_, self.cost = self.sess.run([self._train_op, self.loss],
feed_dict={self.s: batch_memory[:, :self.n_features],
self.q_target: q_target})
self.cost_his.append(self.cost)
# increasing epsilon
self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
self.learn_step_counter += 1
def plot_cost(self):
import matplotlib.pyplot as plt
plt.plot(np.arange(len(self.cost_his)), self.cost_his)
plt.ylabel('Cost')
plt.xlabel('training steps')
plt.show()
- Reference
[1] Salimans T , Ho J , Chen X , et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning[J]. 2017.
[2] Brockhoff D , Auger A , Hansen N , et al. Mirrored Sampling and Sequential Selection for Evolution Strategies[C]// International Conference on Parallel Problem Solving from Nature: Part I. Springer, Berlin, Heidelberg, 2010.
[3] Rainer, StornKenneth, Price. Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces[J]. Journal of Global Optimization, 1997.