在openai gym环境下利用强化学习算法 demo
OpenAI Gym环境:本实验采用CartPole环境,该环境中有一辆小车在一个一维的无阻力轨道上行动,在车上绑着一个连接不太结实的杆,这个杆会左右摇晃,给小车施加一个正向的力或负向的力,保证杆子竖直不倾倒。如下图所示。
本实验采用策略网络,它可以通过观察环境状态,直接预测出概率最大的执行策略,执行这个策略后可以获得最大的期望收益(包括现在的和未来的Reward)
(1)策略网络设计
本实验采用四层MLP网络,其中隐藏层2层.
输入层:当前环境的状态,描述状态通过四个参数:小车位置、速度、杆的角度和角速度。因此输入为四维的向量,故输入层神经元数目为4。
隐藏层:共2层隐藏层,每层有40个神经元。
输出层:输出为Action的概率,Action共两个值,对小车施加正向力和负向力,故输出层只需一个神经元,通过sigmoid激活函数得到概率值Pa。定义一个随机概率Pr,当Pr < Pa则Action取值1,否则取值0。
(2)超参设计
激活函数:隐藏层采用relu函数,输出层采用sigmoid函数。
学习率:固定0.1。
梯度更新策略:Adam,采用策略梯度方法,模型通过学习Action在环境中获得的反馈,使用梯度更新模型参数的过程。
迭代次数:最大10000次,设置Reward达到200时停止迭代,此时训练过程收敛,车上的杆达到稳定状态。
(3)Reward设计
Reward计算采用常用的Discounted Future Reward,即把所有未来奖励依次乘以衰减系数,这里的衰减系数一般是一个略小于但接近1的数,本实验中衰减系数取值0.99。Reward计算方式如下,
(4)训练过程
遍历每一次迭代:
- 初始化:环境初始化;参数初始化;用于存储梯度的GradBuffer初始化,直到完成了一个batch_size的实验,再将汇总的梯度更新到模型参数;
- 前向执行网络得到概率值Pa,即Action取值为1的概率;在(0,1)间随机抽样,若随机值<Pa,则action=1,否则action=0;
- 当前环境状态条件下执行action,得到新的环境状态、当前action的reward和结果标志done,当done为True时该次实验结束,计算得到Discounted Future Reward。同时计算并更新梯度。
- 当reward大于预设门限或到达最大迭代次数时停止迭代。
最优的一次实验中,迭代次数达到约160次后batch内的平均Reward达到200,此时车上的杆达到稳定状态。下图为Reward和迭代次数的关系图。
最后给出源代码:
# -*- coding: utf-8 -*-
"""
Created on Fri Sep 7 10:14:37 2018
"""
import numpy as np
import tensorflow as tf
import gym
from matplotlib import pyplot
#CartPole experiment
env = gym.make('CartPole-v0')
env.reset()
#reward Function
def discount_rewards(r):
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(range(r.size)):
running_add = running_add*gamma + r[t]
discounted_r[t] = running_add
return discounted_r
#############################MLP network parameters and architecture#####
# hyperparameters
hidden_nodes = 40 # number of hidden layer neurons
batch_size = 20
learning_rate = 1e-1 # learning rate
gamma = 0.99 # discount factor for reward
input_dim = 4 # input dimensionality:cart loaction,cart speed,angle of pole, angle speed
tf.reset_default_graph()
#4 layers NN
observations = tf.placeholder(tf.float32, [None,input_dim], name="input_x")
w1 = tf.get_variable("W1",shape=[input_dim,hidden_nodes],initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations,w1))
w2 = tf.get_variable("W2",shape=[hidden_nodes,hidden_nodes],initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.nn.relu(tf.matmul(layer1,w2))
w3 = tf.get_variable("W3",shape=[hidden_nodes,1],initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer2,w3)
probability = tf.nn.sigmoid(score)
#############################input parameters########################
input_y = tf.placeholder(tf.float32, [None,1], name="input_y")
advantages = tf.placeholder(tf.float32,name="reward_signal")
loglik = tf.log(input_y*(input_y-probability)+(1-input_y)*(input_y+probability))
loss = -tf.reduce_mean(loglik*advantages)
tvars = tf.trainable_variables()
newGrads = tf.gradients(loss,tvars)
#adam,batch
adam = tf.train.AdamOptimizer(learning_rate = learning_rate)
w1Grad = tf.placeholder(tf.float32,name="batch_grad1")
w2Grad = tf.placeholder(tf.float32,name="batch_grad2")
w3Grad = tf.placeholder(tf.float32,name="batch_grad3")
batchGrad = [w1Grad,w2Grad,w3Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))
xs,ys,drs = [],[],[]
reward_sum = 0
episode_number = 1
total_episodes = 10000
init = tf.global_variables_initializer()
#############################Training process########################
with tf.Session() as sess:
rendering = False
sess.run(init)
# pyplot.figure(1)
#observation initialization
observation = env.reset()
print(observation)
gradBuffer = sess.run(tvars)
#gradient buffer initialization
for ix,grad in enumerate(gradBuffer):
gradBuffer[ix] = grad * 0
while episode_number <= total_episodes:
if reward_sum/batch_size > 100 or rendering == True:
env.render()
rendering = True
#forward
x = np.reshape(observation,[1,input_dim])
tfprob = sess.run(probability,feed_dict={observations:x})
action = 1 if np.random.uniform()<tfprob else 0
xs.append(x)
y = 1 - action
ys.append(y)
observation,reward,done,info = env.step(action)
reward_sum += reward
drs.append(reward)
if done:
episode_number += 1
epx = np.vstack(xs)
epy = np.vstack(ys)
epr = np.vstack(drs)
xs,ys,drs = [],[],[]
discounted_epr = discount_rewards(epr)
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
tGrad = sess.run(newGrads,feed_dict = {observations:epx,input_y:epy,advantages:discounted_epr})
for ix,grad in enumerate(tGrad):
gradBuffer[ix] += grad
if episode_number % batch_size == 0:
sess.run(updateGrads,feed_dict={w1Grad:gradBuffer[0],w2Grad:gradBuffer[1],w3Grad:gradBuffer[2]})
for ix,grad in enumerate(gradBuffer):
gradBuffer[ix] = grad * 0
print("Average reward for episode %d:%f."%(episode_number,reward_sum/batch_size))
# pyplot.scatter(episode_number,reward_sum/batch_size,c='r',marker='.')
# pyplot.xlabel('episode_number')
# pyplot.ylabel('reward')
# pyplot.pause(0.00001)
if reward_sum/batch_size >= 200:
print("OK:",episode_number)
break
reward_sum = 0
observation = env.reset()
# pyplot.show()