DDPG
之前讨论的应用DQN来玩Atari游戏。然而这些都是在离散环境下完成的, 其中具有有限个行为。考虑一个连续的环境空间,如训练机器人行走。在这些环境下,就不能应用 Q Q Q 学习了,这是因为贪婪策略在每个时间步都需要大量的优化。即使将这一连续环境离散化,也可能会失去一些重要特征,从而最终得到一个庞大的行为空间。在此情况下,很难保证收敛。
为此,使用一种称为行为者评论家的新架构,其中包括两个网络:行为者网络和评论家网络。行为者评论家架构是将策略梯度和状态行为值函数相结合。行为者网络的作用是通过调节参数 θ \theta θ 来确定状态下的最佳行为,评论家网络的作用是评估行为者网络产生的行为。评论家网络是通过计算时间差分误差来评估行为者网络产生的行为的。也就是说,对行为者网络执行策略梯度来选择行为,然后评论家网络利用时间差分误差来评估行为者网络所产生的行为。行为者评论家网络架构如图所示。
类似于DQN, 在此使用一个经验缓存, 通过采样小批量的经验来训练行为者网络和评论家网络。另外,还使用一个独立的行为者目标网络和评论家目标网络来计算损失。
例如,在乒乓球游戏中,会有不同尺度的不同特征,如位置、速度等。因此,将所有特征都以相同尺度来缩放特征。在此,采用一种称为批量归一化的方法来缩放特征。该方法将所有特征进行归一化,使之具有单位均值和方差。那么如何探索新的行为呢?在一个连续环境中,将有n个行为。为探索新的行为,对行为者网络产生的行为添加一些噪声N。在此, 采用一种称为Ornstein-Uhlenbeck的随机过程来产生噪声。
接下来, 将详细分析DDPG算法。
设有两种网络:行为者网络和评论家网络。行为者网络记为
μ
(
s
;
θ
μ
)
\mu\left(s ; \theta^{\mu}\right)
μ(s;θμ),以状态为输人,并产生行为,其中
θ
μ
\theta^{\mu}
θμ是行为者网络的权重。评论家网络记为
Q
(
s
,
a
;
θ
Q
)
Q\left(s, a ; \theta^{Q}\right)
Q(s,a;θQ),以状态和行为为输人,并返回Q值,其中
θ
Q
\theta^{Q}
θQ 为评论家网络的权重。
同理,分别对行为者目标网络和评论家网络定义一个目标网络 μ ( s ; θ μ ′ ) \mu\left(s ; \theta^{\mu^{\prime}}\right) μ(s;θμ′)和 Q ( s , a ; θ Q ′ ) Q\left(s, a ; \theta^{Q^{\prime}}\right) Q(s,a;θQ′),其中, θ μ ′ \theta^{\mu^{\prime}} θμ′ 和 θ Q ′ \theta^{Q^{\prime}} θQ′分别为行为者目标网络和评论家目标网络的权重。
通过策略梯度来更新行为者网络的权重,根据由时间差分误差计算而得的梯度来更新评论家网络的权重。
首先,通过对行为者网络产生的行为添加探索噪声N来选择一个行为,如 μ ( s ; θ μ ) + N \mu\left(s ; \theta^{\mu}\right)+N μ(s;θμ)+N。在状态 s s s下执行该行为,接收奖励 r r r,并转移到一个新状态 s ′ s' s′。将状态转移信息保存在经验回放缓存中。
经过多次迭代,从回放缓存中采样转移信息来训练网络,然后计算目标Q值, y i = r i + γ Q ′ ( s i + 1 , μ ′ ( s i + 1 ∣ θ μ ′ ) ∣ θ Q ) y_{i}=r_{i}+\gamma Q^{\prime}\left(s_{i+1}, \mu^{\prime}\left(s_{i+1} | \theta^{\mu^{\prime}}\right) | \theta^{Q}\right) yi=ri+γQ′(si+1,μ′(si+1∣θμ′)∣θQ)。计算时间差分误差为
L = 1 M ∑ i ( y i − Q ( s i , a i ∣ θ Q ) 2 ) L=\frac{1}{M} \sum_{i}\left(y_{i}-Q\left(s_{i}, a_{i} | \theta^{Q}\right)^{2}\right) L=M1∑i(yi−Q(si,ai∣θQ)2)
式中,M是从回放缓存中采样的用于训练的样本数。
根据由损失L计算所得的梯度来更新评论家网络的权重。
同理,通过策略梯度更新策略网络的权重,然后在目标网络中更新行为者网络和评论家网络的权重。缓慢更新目标网络的权重,会提高稳定性,这称为替换:
θ ′ < − τ θ + ( 1 − τ ) θ ′ \theta^{\prime}<-\tau \theta+(1-\tau) \theta^{\prime} θ′<−τθ+(1−τ)θ′
倒立摆
智能体使倒立摆摆动起来,使之保持直立。
import tensorflow as tf
import numpy as np
import gym
定义超参数如下:
# number of steps in each episode 每个情景中时间步的个数
epsiode_steps = 500
# learning rate for actor
lr_a = 0.001
# learning rate for critic
lr_c = 0.002
# discount factor
gamma = 0.9
# soft replacement soft替换
alpha = 0.01
# replay buffer size 回放缓存大小
memory = 10000
# batch size for training 用于训练的批量样本大小
batch_size = 32
render = False
DDPG的实现
class DDPG(object):
def __init__(self, no_of_actions, no_of_states, a_bound,):
# initialize the memory with shape as no of actions, no of states and our defined memory size
# 初始化记忆内存,形式为无行为、无状态及自定义的记忆内存大小
self.memory = np.zeros((memory, no_of_states * 2 + no_of_actions + 1), dtype=np.float32)
# initialize pointer to point to our experience buffer 初始化指针指向经验缓存
self.pointer = 0
# initialize tensorflow session
self.sess = tf.Session()
# initialize the variance for OU process for exploring policies 针对探索策略过程初始化方差
self.noise_variance = 3.0
self.no_of_actions, self.no_of_states, self.a_bound = no_of_actions, no_of_states, a_bound,
# placeholder for current state, next state and rewards
self.state = tf.placeholder(tf.float32, [None, no_of_states], 's')
self.next_state = tf.placeholder(tf.float32, [None, no_of_states], 's_')
self.reward = tf.placeholder(tf.float32, [None, 1], 'r')
# build the actor network which has separate eval(primary) and target network 构建独立的行为者网络(主网络)和目标网络
with tf.variable_scope('Actor'):
self.a = self.build_actor_network(self.state, scope='eval', trainable=True)
a_ = self.build_actor_network(self.next_state, scope='target', trainable=False)
# build the critic network which has separate eval(primary) and target network 构建独立的评论家网络(主网络)和目标网络
with tf.variable_scope('Critic'):
q = self.build_crtic_network(self.state, self.a, scope='eval', trainable=True)
q_ = self.build_crtic_network(self.next_state, a_, scope='target', trainable=False)
# initialize the network parameters
self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')
self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')
self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')
self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')
# update target value 更新目标值
self.soft_replace = [[tf.assign(at, (1-alpha)*at+alpha*ae), tf.assign(ct, (1-alpha)*ct+alpha*ce)]
for at, ae, ct, ce in zip(self.at_params, self.ae_params, self.ct_params, self.ce_params)]
# compute target Q value, we know that Q(s,a) = reward + gamma * Q'(s',a')
q_target = self.reward + gamma * q_
# compute TD error i.e actual - predicted values 计算TD误差,即实际值-预测值
td_error = tf.losses.mean_squared_error(labels=(self.reward + gamma * q_), predictions=q)
# train the critic network with adam optimizer
self.ctrain = tf.train.AdamOptimizer(lr_c).minimize(td_error, name="adam-ink", var_list = self.ce_params)
# compute the loss in actor network
a_loss = - tf.reduce_mean(q)
# train the actor network with adam optimizer for minimizing the loss
self.atrain = tf.train.AdamOptimizer(lr_a).minimize(a_loss, var_list=self.ae_params)
# initialize summary writer to visualize our network in tensorboard 初始化summary以在tensorboard中可视化
tf.summary.FileWriter("logs", self.sess.graph)
# initialize all variables
self.sess.run(tf.global_variables_initializer())
# How do we select acion in DDPG? We select action by adding noise to the action space. We use
# Ornstein-Uhlenbeck random process for generating noise 通过对行为空间中添加噪声后来选择行为
def choose_action(self, s):
a = self.sess.run(self.a, {self.state: s[np.newaxis, :]})[0]
a = np.clip(np.random.normal(a, self.noise_variance), -2, 2)
return a
# then we define the function called learn where the actual training happens,
# here we select a minibatch of states, actions, rewards and next state from the experience buffer
# and we train actor and critic network 从经验缓存中选择一批states, actions, rewards and next state
def learn(self):
# soft target replacement 软目标置换
self.sess.run(self.soft_replace)
indices = np.random.choice(memory, size=batch_size)
batch_transition = self.memory[indices, :]
batch_states = batch_transition[:, :self.no_of_states]
batch_actions = batch_transition[:, self.no_of_states: self.no_of_states + self.no_of_actions]
batch_rewards = batch_transition[:, -self.no_of_states - 1: -self.no_of_states]
batch_next_state = batch_transition[:, -self.no_of_states:]
self.sess.run(self.atrain, {self.state: batch_states})
self.sess.run(self.ctrain, {self.state: batch_states, self.a: batch_actions, self.reward: batch_rewards, self.next_state: batch_next_state})
# we define a function store_transition which stores all the transition information in the buffer
# 在缓存中保存所有的转移信息
def store_transition(self, s, a, r, s_):
trans = np.hstack((s,a,[r],s_))
index = self.pointer % memory
self.memory[index, :] = trans
self.pointer += 1
if self.pointer > memory:
self.noise_variance *= 0.99995
self.learn()
# we define the function build_actor_network for builing our actor network
def build_actor_network(self, s, scope, trainable):
# Actor DPG
with tf.variable_scope(scope):
l1 = tf.layers.dense(s, 30, activation = tf.nn.tanh, name = 'l1', trainable = trainable)
a = tf.layers.dense(l1, self.no_of_actions, activation = tf.nn.tanh, name = 'a', trainable = trainable)
return tf.multiply(a, self.a_bound, name = "scaled_a")
# followed by we define the function build_crtic_network which build our critic network
def build_crtic_network(self, s, a, scope, trainable):
with tf.variable_scope(scope):
n_l1 = 30
w1_s = tf.get_variable('w1_s', [self.no_of_states, n_l1], trainable = trainable)
w1_a = tf.get_variable('w1_a', [self.no_of_actions, n_l1], trainable = trainable)
b1 = tf.get_variable('b1', [1, n_l1], trainable = trainable)
net = tf.nn.tanh( tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1 )
q = tf.layers.dense(net, 1, trainable = trainable)
return q
初始gym环境
env = gym.make("Pendulum-v0")
env = env.unwrapped
env.seed(1)
得到状态数和行为数:
no_of_states = env.observation_space.shape[0]
no_of_actions = env.action_space.shape[0]
另外,行为的上界:
a_bound = env.action_space.high
创建DDPG类的对象:
ddpg = DDPG(no_of_actions, no_of_states, a_bound)
# for storing the total rewards
total_reward = []
# set the number of episodes
no_of_episodes = 300
现在,开始训练:
# for each episodes
for i in range(no_of_episodes):
# initialize the environment
s = env.reset()
# episodic reward
ep_reward = 0
for j in range(epsiode_steps):
env.render()
# select action by adding noise through OU process
# 通过在过程中添加噪声来选择行为
a = ddpg.choose_action(s)
# peform the action and move to the next state s
s_, r, done, info = env.step(a)
# store the the transition to our experience buffer
# sample some minibatch of experience and train the network
ddpg.store_transition(s, a, r, s_)
# update current state as next state
s = s_
# add episodic rewards
ep_reward += r
if j == epsiode_steps-1:
# store the total rewards
total_reward.append(ep_reward)
# print rewards obtained per each episode
print('Episode:', i, ' Reward: %i' % int(ep_reward))
break
打包程序
DDPG.py:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
import gym
class DDPG(object):
def __init__(self, n_states, n_actions, action_low_bound, action_high_bound, gamma=0.99,
actor_lr=0.002, critic_lr=0.002, tau=0.01, memory_size=10000, batch_size=32):
self.n_states = n_states
self.n_actions = n_actions
self.action_low_bound = action_low_bound
self.action_high_bound = action_high_bound
self.gamma = gamma
self.actor_lr = actor_lr
self.critic_lr = critic_lr
self.tau = tau
self.memory_size = memory_size
self.memory = np.zeros([self.memory_size, self.n_states*2 + self.n_actions + 1])
self.memory_counter = 0
self.batch_size = batch_size
self.action_noise = 3
self.action_noise_decay = 0.9995
self.learning_counter = 0
self._build_graph()
self.session = tf.Session()
self.session.run(tf.global_variables_initializer())
def _build_graph(self):
self.s = tf.placeholder(tf.float32, [None, self.n_states], name='s')
self.s_ = tf.placeholder(tf.float32, [None, self.n_states], name='s_')
self.r = tf.placeholder(tf.float32, [None, 1], name='r')
self.low_action = tf.constant(self.action_low_bound, dtype=tf.float32)
self.high_action = tf.constant(self.action_high_bound, dtype=tf.float32)
self.actor_net = self._build_actor_net(s=self.s, trainable=True, scope='actor_eval')
self.actor_target_net = self._build_actor_net(s=self.s_, trainable=False, scope='actor_target')
self.critic_net = self._build_critic_net(s=self.s, a=self.actor_net, trainable=True, scope='critic_eval')
self.critic_target_net = self._build_critic_net(s=self.s_, a=self.actor_target_net, trainable=False,
scope='critic_target')
self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='actor_eval')
self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='actor_target')
self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='critic_eval')
self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='critic_target')
self.soft_replace = [[tf.assign(ta, (1 - self.tau) * ta + self.tau * ea),
tf.assign(tc, (1 - self.tau) * tc + self.tau * ec)]
for ta, ea, tc, ec in zip(self.at_params, self.ae_params, self.ct_params, self.ce_params)]
with tf.variable_scope('critic_loss'):
q_target = self.r + self.gamma*self.critic_target_net
self.critic_loss_op = tf.reduce_mean(tf.squared_difference(q_target, self.critic_net))
with tf.variable_scope('critic_train'):
self.critic_train_op = tf.train.AdamOptimizer(self.critic_lr).minimize(self.critic_loss_op,
var_list=self.ce_params)
with tf.variable_scope('actor_loss'):
self.actor_loss_op = -tf.reduce_mean(self.critic_net) # maximize q
with tf.variable_scope('actor_train'):
self.actor_train_op = tf.train.AdamOptimizer(self.actor_lr).minimize(self.actor_loss_op,
var_list=self.ae_params)
def _build_actor_net(self, s, trainable, scope):
k_init, b_init = tf.random_normal_initializer(0., 0.1), tf.constant_initializer(0.1)
h1_units = 64
with tf.variable_scope(scope):
w1 = tf.get_variable(name='w1', shape=[self.n_states, h1_units], initializer=k_init, trainable=trainable)
b1 = tf.get_variable(name='b1', shape=[h1_units], initializer=b_init, trainable=trainable)
h1 = tf.nn.relu(tf.matmul(s, w1) + b1)
w2 = tf.get_variable(name='w2', shape=[h1_units, self.n_actions], initializer=k_init, trainable=trainable)
b2 = tf.get_variable(name='b2', shape=[self.n_actions], initializer=b_init, trainable=trainable)
actor_net = tf.matmul(h1, w2) + b2
actor_net = tf.clip_by_value(actor_net, self.low_action, self.high_action)
return actor_net
def _build_critic_net(self, s, a, trainable, scope):
k_init, b_init = tf.random_normal_initializer(0., 0.2), tf.constant_initializer(0.1)
h1_units = 64
with tf.variable_scope(scope):
w1s = tf.get_variable(name='w1s', shape=[self.n_states, h1_units], initializer=k_init, trainable=trainable)
w1a = tf.get_variable(name='w1a', shape=[self.n_actions, h1_units], initializer=k_init, trainable=trainable)
b1 = tf.get_variable(name='b1_e', shape=[h1_units], initializer=b_init, trainable=trainable)
h1 = tf.nn.relu(tf.matmul(s, w1s) + tf.matmul(a, w1a) + b1)
w2 = tf.get_variable(name='w2', shape=[h1_units, 1], initializer=k_init, trainable=trainable)
b2 = tf.get_variable(name='b2', shape=[1], initializer=b_init, trainable=trainable)
critic_net = tf.matmul(h1, w2) + b2
return critic_net
def choose_action(self, s):
s = s[np.newaxis, :]
action_probs = self.session.run(self.actor_net, feed_dict={self.s: s})
action = action_probs[0]
action = np.clip(np.random.normal(action, self.action_noise), -2, 2)
self.action_noise *= self.action_noise_decay
return action
def learn(self):
self.session.run(self.soft_replace)
bs, ba, br, bs_ = self.sample_memory()
fetches = [self.actor_train_op, self.critic_train_op]
self.session.run(fetches=fetches, feed_dict={self.s: bs, self.actor_net: ba,
self.r: br, self.s_: bs_})
self.learning_counter += 1
def store_memory(self, s, a, r, s_):
transition = np.hstack((s, a, r, s_))
index = self.memory_counter % self.memory_size
self.memory[index, :] = transition
self.memory_counter += 1
def sample_memory(self):
assert self.memory_counter >= self.batch_size
if self.memory_counter <= self.memory_size:
index = np.random.choice(self.memory_counter, self.batch_size)
else:
index = np.random.choice(self.memory_size, self.batch_size)
batch_memory = self.memory[index, :]
bs = batch_memory[:, :self.n_states]
ba = batch_memory[:, self.n_states: self.n_states+self.n_actions]
br = batch_memory[:, self.n_states+self.n_actions]
bs_ = batch_memory[:, -self.n_states:]
br = br[:, np.newaxis]
return bs, ba, br, bs_
run.py
import gym
import matplotlib.pyplot as plt
from ddpg import DDPG
GAME = 'Pendulum-v0'
MAX_STEP = 100
EPISODES = 3000
env = gym.make(GAME)
env = env.unwrapped
env.seed(1)
n_states = env.observation_space.shape[0]
n_actions = env.action_space.shape[0]
action_low_bound = env.action_space.low
action_high_bound = env.action_space.high
agent = DDPG(n_states, n_actions, action_low_bound, action_high_bound)
def run():
plt.ion()
total_r = 0
avg_ep_r_hist = []
for episode in range(EPISODES):
ep_step = 0
ep_r = 0
s = env.reset()
while True:
a = agent.choose_action(s)
s_, r, done, info = env.step(a)
agent.store_memory(s, a, r, s_)
ep_r += r
total_r += r
ep_step += 1
if agent.memory_counter >= agent.batch_size:
agent.learn()
if ep_step >= MAX_STEP:
break
s = s_
if episode >= 10:
avg_ep_r = total_r/(episode+1)
avg_ep_r_hist.append(avg_ep_r)
if episode % 20 == 0:
print('Episode %d Avg Reward/Ep %s' % (episode, avg_ep_r))
plt.cla()
plt.plot(avg_ep_r_hist)
plt.pause(0.0001)
plt.ioff()
plt.show()
if __name__ == '__main__':
run()