A3C(Asynchronous Advantage Actor-Critic)中的3个A是什么意思?
Asynchronous(异步): 传统的DQN用一张网络代表一个Agent,而且Agent只与单一环境进行交互。A3C通过异步的、多线程的方法提高的学习的效率和鲁棒性。A3C搭建了一个全局参数网络,并建立了多个Worker子线程,每个Woker子线程有自己的参数网络,并且与各自的训练环境交互学习。Worker定期地将自己的参数异步更新到全局参数网络上,既保证了训练速度,也避免了采样问题(每个Worker的任务相同,但是环境细节都不一样)。
Actor-Critic:A3C的本质还是Actor-Critic,它结合了value-iteration方法(Q-learning)和policy-iteration方法(Policy Gradient)的优点。A3C网络在同一张网络上评估value function V(s) (当前环境状态的好坏)和policy π(s) (输出动作的可能性分布),仅仅在最后负责输出的全连接网络部分有所不同。
Advantage(优势):为了加速训练,A3C在更新权重时会适当的激励或者惩罚某些动作。A3C使用了advantage estimates来保证Agent知道某个action的回报到底有多好。advantage function可以表示为:Advantage: A = Q(s,a) - V(s)
因为A3C没有直接估计Q值,所以可以使用discounted returns (R)作为Q(s,a)的估计:Advantage Estimate: A = R - V(s)
工程细节
A3C的工程实现架构主要由全局网络global network和Worker组成:
AC_Network — This class contains all the Tensorflow ops to create the networks themselves.
Worker — This class contains a copy of AC_Network, an environment class, as well as all the logic for interacting with the environment, and updating the global network.
High-level code for establishing the Worker instances and running them in parallel.
实现A3C算法先要构造global network,这个网络使用CNN处理空间信息,再用LSTM处理时序依赖,最后接上value和policy的输出层次——
class AC_Network():
def __init__(self,s_size,a_size,scope,trainer):
with tf.variable_scope(scope):
#Input and visual encoding layers
self.inputs = tf.placeholder(shape=[None,s_size],dtype=tf.float32)
self.imageIn = tf.reshape(self.inputs,shape=[-1,84,84,1])
# CNN处理空间信息
self.conv1 = slim.conv2d(activation_fn=tf.nn.elu,
inputs=self.imageIn,num_outputs=16,
kernel_size=[8,8],stride=[4,4],padding='VALID')
self.conv2 = slim.conv2d(activation_fn=tf.nn.elu,
inputs=self.conv1,num_outputs=32,
kernel_size=[4,4],stride=[2,2],padding='VALID')
hidden = slim.fully_connected(slim.flatten(self.conv2),256,activation_fn=tf.nn.elu)
# LSTM 处理时序依赖
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(256,state_is_tuple=True)
c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
self.state_init = [c_init, h_init]
c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
self.state_in = (c_in, h_in)
rnn_in = tf.expand_dims(hidden, [0])
step_size = tf.shape(self.imageIn)[:1]
state_in = tf.nn.rnn_cell.LSTMStateTuple(c_in, h_in)
lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
lstm_cell, rnn_in, initial_state=state_in, sequence_length=step_size,
time_major=False)
lstm_c, lstm_h = lstm_state
self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
rnn_out = tf.reshape(lstm_outputs, [-1, 256])
# 输出 policy 和 value
self.policy = slim.fully_connected(rnn_out,a_size,
activation_fn=tf.nn.softmax,
weights_initializer=normalized_columns_initializer(0.01),
biases_initializer=None)
self.value = slim.fully_connected(rnn_out,1,
activation_fn=None,
weights_initializer=normalized_columns_initializer(1.0),
biases_initializer=None)
建立Global Network之后,再建立若干个Worker(及其各自的参数和环境)。因为每个Worker都在单独的CPU子线程上运行,所以创建的Worker数量最好不要超过CPU线程数。
with tf.device("/cpu:0"):
# Generate global network
master_network = AC_Network(s_size,a_size,'global',None)
# Set workers ot number of available CPU threads
num_workers = multiprocessing.cpu_count()
workers = []
# Create worker classes
for i in range(num_workers):
workers.append(Worker(DoomGame(),i,s_size,a_size,trainer,saver,model_path))
with tf.Session() as sess:
coord = tf.train.Coordinator()
if load_model == True:
print 'Loading Model...'
ckpt = tf.train.get_checkpoint_state(model_path)
saver.restore(sess,ckpt.model_checkpoint_path)
else:
sess.run(tf.global_variables_initializer())
# This is where the asynchronous magic happens.
# Start the "work" process for each worker in a separate threat.
worker_threads = []
for worker in workers:
worker_work = lambda: worker.work(max_episode_length,gamma,master_network,sess,coord)
t = threading.Thread(target=(worker_work))
t.start()
worker_threads.append(t)
coord.join(worker_threads)
Global Network和Workers创建完毕后,就可以考虑异步更新的问题了。在TensorFlow的背景下,我们可以把worker上的参数赋值到global network上。
# Copies one set of variables to another.
# Used to set worker network parameters to those of global network.
def update_target_graph(from_scope,to_scope):
from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, from_scope)
to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, to_scope)
op_holder = []
for from_var,to_var in zip(from_vars,to_vars):
op_holder.append(to_var.assign(from_var))
return op_holder
class Worker():
def __init__(self,game,name,s_size,a_size,trainer,saver,model_path):
....
....
....
#Create the local copy of the network and the tensorflow op to copy global paramters to local network
self.local_AC = AC_Network(s_size,a_size,self.name,trainer)
self.update_local_ops = update_target_graph('global',self.name)
每个worker和各自的环境进行交互并且收集经验,经验以五元组列表(observation, action, reward, done, value) 的方式保存起来。
class Worker():
....
....
....
def work(self,max_episode_length,gamma,global_AC,sess,coord):
episode_count = 0
total_step_count = 0
print "Starting worker " + str(self.number)
with sess.as_default(), sess.graph.as_default():
while not coord.should_stop():
sess.run(self.update_local_ops)
episode_buffer = []
episode_values = []
episode_frames = []
episode_reward = 0
episode_step_count = 0
d = False
self.env.new_episode()
s = self.env.get_state().screen_buffer
episode_frames.append(s)
s = process_frame(s)
rnn_state = self.local_AC.state_init
while self.env.is_episode_finished() == False:
#Take an action using probabilities from policy network output.
a_dist,v,rnn_state = sess.run([self.local_AC.policy,self.local_AC.value,self.local_AC.state_out],
feed_dict={self.local_AC.inputs:[s],
self.local_AC.state_in[0]:rnn_state[0],
self.local_AC.state_in[1]:rnn_state[1]})
a = np.random.choice(a_dist[0],p=a_dist[0])
a = np.argmax(a_dist == a)
r = self.env.make_action(self.actions[a]) / 100.0
d = self.env.is_episode_finished()
if d == False:
s1 = self.env.get_state().screen_buffer
episode_frames.append(s1)
s1 = process_frame(s1)
else:
s1 = s
episode_buffer.append([s,a,r,s1,d,v[0,0]])
episode_values.append(v[0,0])
episode_reward += r
s = s1
total_steps += 1
episode_step_count += 1
#Specific to VizDoom. We sleep the game for a specific time.
if self.sleep_time>0:
sleep(self.sleep_time)
# If the episode hasn't ended, but the experience buffer is full, then we
# make an update step using that experience rollout.
if len(episode_buffer) == 30 and d != True and episode_step_count != max_episode_length - 1:
# Since we don't know what the true final return is, we "bootstrap" from our current
# value estimation.
v1 = sess.run(self.local_AC.value,
feed_dict={self.local_AC.inputs:[s],
self.local_AC.state_in[0]:rnn_state[0],
self.local_AC.state_in[1]:rnn_state[1]})[0,0]
v_l,p_l,e_l,g_n,v_n = self.train(global_AC,episode_buffer,sess,gamma,v1)
episode_buffer = []
sess.run(self.update_local_ops)
if d == True:
break
self.episode_rewards.append(episode_reward)
self.episode_lengths.append(episode_step_count)
self.episode_mean_values.append(np.mean(episode_values))
# Update the network using the experience buffer at the end of the episode.
v_l,p_l,e_l,g_n,v_n = self.train(global_AC,episode_buffer,sess,gamma,0.0)
当worker收集足够多的经验之后,A3C就可以用经验计算折扣回报和advantage了,并且再用它们来计算value loss和policy loss。除此之外,A3C还计算policy的熵(H) ,熵值可以用于exploration。如果policy outputs actions的概率差不多,那么熵值偏高,如果policy强烈建议某个action,那么熵值就偏低。Value Loss: L = Σ(R - V(s))²
Policy Loss: L = -log(π(s)) * A(s) - β*H(π)
worker用value loss和policy loss计算其网络参数的梯度。为了防止过度参数更新造成的policy不稳定,worker的梯度还必须经过裁剪。worker用这些梯度值更新global network的参数值。
class AC_Network():
def __init__(self,s_size,a_size,scope,trainer):
....
....
....
if scope != 'global':
self.actions = tf.placeholder(shape=[None],dtype=tf.int32)
self.actions_onehot = tf.one_hot(self.actions,a_size,dtype=tf.float32)
self.target_v = tf.placeholder(shape=[None],dtype=tf.float32)
self.advantages = tf.placeholder(shape=[None],dtype=tf.float32)
self.responsible_outputs = tf.reduce_sum(self.policy * self.actions_onehot, [1])
#Loss functions
self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01
#Get gradients from local network using local losses
local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
self.gradients = tf.gradients(self.loss,local_vars)
self.var_norms = tf.global_norm(local_vars)
grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)
#Apply local gradients to global network
global_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
self.apply_grads = trainer.apply_gradients(zip(grads,global_vars))
class Worker():
....
....
....
def train(self,global_AC,rollout,sess,gamma,bootstrap_value):
rollout = np.array(rollout)
observations = rollout[:,0]
actions = rollout[:,1]
rewards = rollout[:,2]
next_observations = rollout[:,3]
values = rollout[:,5]
# Here we take the rewards and values from the rollout, and use them to
# generate the advantage and discounted returns.
# The advantage function uses "Generalized Advantage Estimation"
self.rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value])
discounted_rewards = discount(self.rewards_plus,gamma)[:-1]
self.value_plus = np.asarray(values.tolist() + [bootstrap_value])
advantages = rewards + gamma * self.value_plus[1:] - self.value_plus[:-1]
advantages = discount(advantages,gamma)
# Update the global network using gradients from loss
# Generate network statistics to periodically save
rnn_state = self.local_AC.state_init
feed_dict = {self.local_AC.target_v:discounted_rewards,
self.local_AC.inputs:np.vstack(observations),
self.local_AC.actions:actions,
self.local_AC.advantages:advantages,
self.local_AC.state_in[0]:rnn_state[0],
self.local_AC.state_in[1]:rnn_state[1]}
v_l,p_l,e_l,g_n,v_n,_ = sess.run([self.local_AC.value_loss,
self.local_AC.policy_loss,
self.local_AC.entropy,
self.local_AC.grad_norms,
self.local_AC.var_norms,
self.local_AC.apply_grads],
feed_dict=feed_dict)
return v_l / len(rollout),p_l / len(rollout),e_l / len(rollout), g_n,v_n
global network被更新之后,各个worker把自己的参数设置为global net的参数值,并在此基础上开始下一轮训练。
参考内容: