DPPO原理
DPPO 和 A3C 的思路其实是一致的,希望用多个智能体同时和环境互动,并对全局的PPO 网络进行更新。
在A3C,我们需要跑数据并且计算好梯度,再更新全局网络。这是因为 AC 是一个在线的算法,所以在更新的时候,产生数据的策略和更新的策略需要是同一个网络。所以我们不能把 worker 产出的数据,直接给全局网络计算梯度用。
但PPO 解决了离线更新策略的问题,所以 DPPO 的worker 只需要提供数据给全局网络,由全局网络从数据中直接学习。
线程通信
和A3C一样,DPPO同样使用多线程来学习。全局网络就是PPO,他的功能是学习;
worker,在全局网络的〝指导”下,和环境互动,并且把数据保存起来提供给全局网络学习。
你可能会发现,这两项工作之间是不能同时的。当全局网络在学习的时候,workers需要等待全局网络学习完,才能干活,workers 在干活的时候,全局网络就需要等待 worker提供数据。
我们可以把每一条线程想象成一根水管,线程按照代码运行,就像水在水管里面运行一样。如果某些线程需要等待其他的线程工作完成以后,再继续进行。那么,这些水管就要安装上开关。安装开关的方法很简单,只需要在需要安装的位置,加上这么一句代码,让代码执行进入等待状态。
线程运行到这里,将会暂停。我们称为该线程被“阻塞”了。那么怎么才能通过这个开关,继续运行下去呢?我们先可以想象有一个管理开关的工人,该工人会看“信号灯〞,根据信号灯的情况决定是否让线程继续下去。
我们有两个操作,可以改变信号灯。event.clear(),信号为 False, 表示“不能”让程序进行。event.set(),信号为 True,表示“可以”让程序进行下去。
这样,我们用两个事件,就能解决两组不同线程的通信问题。假如,我们有两组线程 A,B。要求当 A 在运行时,B等待;当A运行完毕,B开始运行,A 等待。那么我们可以这样做:
-
首先我们在 A,B 两组不同线程的适当位置,插入“开关〞:分别是 A_event.wait(),B_event.wait(),这时候,两个开关的信号灯都默认为 False,就是阻塞状态。
-
我们把 A_event 的信号设置为 True, B_event 信号设置为 False : A_event.set(),nBevent.clear(), 这样,当进程start之后,A进程运行会继续,B进程会在 B_event.wait()被阻塞,等待A进程运行。
-
A进程运行到最后,需要B进程启动。A_event.clear(), B_event.set()。这时候,B进程在 B_event.wait()位置,看到信号灯设置为 True,就会继续执行。但这时候,A并不会立即停止,通常 A 是一个多次的循环,这时候 A 会重新开始,并执行到A_event.wait()的位置,由于这时候,信号为 False。所以 A 就不再执行下去了,需要等待信号灯变更。
-
同理,我们需要在B最后,设置信号。 A_event.set(), B_event.clear()。这样 AB 两个线程就会交替进行下去。
DPPO代码实现
"""
A simple version of OpenAI's Proximal Policy Optimization (PPO). [https://arxiv.org/abs/1707.06347]
Distributing workers in parallel to collect data, then stop worker's roll-out and train PPO on collected data.
Restart workers once PPO is updated.
The global PPO updating rule is adopted from DeepMind's paper (DPPO):
Emergence of Locomotion Behaviours in Rich Environments (Google Deepmind): [https://arxiv.org/abs/1707.02286]
View more on my tutorial website: https://morvanzhou.github.io/tutorials
Dependencies:
tensorflow r1.3
gym 0.9.2
"""
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
import matplotlib.pyplot as plt
import gym, threading, queue
EP_MAX = 1000
EP_LEN = 200
N_WORKER = 8 # parallel workers
GAMMA = 0.9 # reward discount factor
A_LR = 0.0001 # learning rate for actor
C_LR = 0.0002 # learning rate for critic
MIN_BATCH_SIZE = 64 # minimum batch size for updating PPO
UPDATE_STEP = 10 # loop update operation n-steps
EPSILON = 0.2 # for clipping surrogate objective
GAME = 'Pendulum-v1'
S_DIM, A_DIM = 3, 1 # state and action dimension
class PPO(object):
def __init__(self):
self.sess = tf.Session()
self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state')
# critic
l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu)
self.v = tf.layers.dense(l1, 1)
self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r')
self.advantage = self.tfdc_r - self.v
self.closs = tf.reduce_mean(tf.square(self.advantage))
self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs)
# actor
pi, pi_params = self._build_anet('pi', trainable=True)
oldpi, oldpi_params = self._build_anet('oldpi', trainable=False)
self.sample_op = tf.squeeze(pi.sample(1), axis=0) # operation of choosing action
self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]
self.tfa = tf.placeholder(tf.float32, [None, A_DIM], 'action')
self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage')
# ratio = tf.exp(pi.log_prob(self.tfa) - oldpi.log_prob(self.tfa))
ratio = pi.prob(self.tfa) / (oldpi.prob(self.tfa) + 1e-5)
surr = ratio * self.tfadv # surrogate loss
self.aloss = -tf.reduce_mean(tf.minimum( # clipped surrogate objective
surr,
tf.clip_by_value(ratio, 1. - EPSILON, 1. + EPSILON) * self.tfadv))
self.atrain_op = tf.train.AdamOptimizer(A_LR).minimize(self.aloss)
self.sess.run(tf.global_variables_initializer())
def update(self):
global GLOBAL_UPDATE_COUNTER
while not COORD.should_stop():
if GLOBAL_EP < EP_MAX:
UPDATE_EVENT.wait() # wait until get batch of data
self.sess.run(self.update_oldpi_op) # copy pi to old pi
data = [QUEUE.get() for _ in range(QUEUE.qsize())] # collect data from all workers
data = np.vstack(data)
s, a, r = data[:, :S_DIM], data[:, S_DIM: S_DIM + A_DIM], data[:, -1:]
adv = self.sess.run(self.advantage, {self.tfs: s, self.tfdc_r: r})
# update actor and critic in a update loop
[self.sess.run(self.atrain_op, {self.tfs: s, self.tfa: a, self.tfadv: adv}) for _ in range(UPDATE_STEP)]
[self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r}) for _ in range(UPDATE_STEP)]
UPDATE_EVENT.clear() # updating finished
GLOBAL_UPDATE_COUNTER = 0 # reset counter
ROLLING_EVENT.set() # set roll-out available
def _build_anet(self, name, trainable):
with tf.variable_scope(name):
l1 = tf.layers.dense(self.tfs, 200, tf.nn.relu, trainable=trainable)
mu = 2 * tf.layers.dense(l1, A_DIM, tf.nn.tanh, trainable=trainable)
sigma = tf.layers.dense(l1, A_DIM, tf.nn.softplus, trainable=trainable)
norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
return norm_dist, params
def choose_action(self, s):
s = s[np.newaxis, :]
a = self.sess.run(self.sample_op, {self.tfs: s})[0]
return np.clip(a, -2, 2)
def get_v(self, s):
if s.ndim < 2: s = s[np.newaxis, :]
return self.sess.run(self.v, {self.tfs: s})[0, 0]
class Worker(object):
def __init__(self, wid):
self.wid = wid
if self.wid == 0:
self.env = gym.make(GAME, render_mode='human')
else:
self.env = gym.make(GAME)
self.ppo = GLOBAL_PPO
def work(self):
global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER
while not COORD.should_stop():
s = self.env.reset()[0]
ep_r = 0
buffer_s, buffer_a, buffer_r = [], [], []
for t in range(EP_LEN):
if not ROLLING_EVENT.is_set(): # while global PPO is updating
ROLLING_EVENT.wait() # wait until PPO is updated
buffer_s, buffer_a, buffer_r = [], [], [] # clear history buffer, use new policy to collect data
a = self.ppo.choose_action(s)
s_, r, done, _, _ = self.env.step(a)
buffer_s.append(s)
buffer_a.append(a)
buffer_r.append((r + 8) / 8) # normalize reward, find to be useful
s = s_
ep_r += r
GLOBAL_UPDATE_COUNTER += 1 # count to minimum batch size, no need to wait other workers
if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE:
v_s_ = self.ppo.get_v(s_)
discounted_r = [] # compute discounted reward
for r in buffer_r[::-1]:
v_s_ = r + GAMMA * v_s_
discounted_r.append(v_s_)
discounted_r.reverse()
bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis]
buffer_s, buffer_a, buffer_r = [], [], []
QUEUE.put(np.hstack((bs, ba, br))) # put data in the queue
if GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE:
ROLLING_EVENT.clear() # stop collecting data
UPDATE_EVENT.set() # globalPPO update
if GLOBAL_EP >= EP_MAX: # stop training
COORD.request_stop()
break
# record reward changes, plot later
if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r)
else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1]*0.9+ep_r*0.1)
GLOBAL_EP += 1
print('{0:.1f}%'.format(GLOBAL_EP/EP_MAX*100), '|W%i' % self.wid, '|Ep_r: %.2f' % ep_r,)
if __name__ == '__main__':
GLOBAL_PPO = PPO()
UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event()
UPDATE_EVENT.clear() # not update now
ROLLING_EVENT.set() # start to roll out
workers = [Worker(wid=i) for i in range(N_WORKER)]
GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0
GLOBAL_RUNNING_R = []
COORD = tf.train.Coordinator()
QUEUE = queue.Queue() # workers putting data in this queue
threads = []
for worker in workers[::-1]: # worker threads
if worker.wid != 0:
t = threading.Thread(target=worker.work, args=())
t.start() # training
threads.append(t)
elif worker.wid == 0:
# add a PPO updating thread
threads.append(threading.Thread(target=GLOBAL_PPO.update, ))
threads[-1].start()
worker.work()
COORD.join(threads)
# plot reward change and test
plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)
plt.xlabel('Episode'); plt.ylabel('Moving reward'); plt.ion(); plt.show()
env = gym.make('Pendulum-v1')
while True:
s = env.reset()[0]
for t in range(300):
env.render()
s = env.step(GLOBAL_PPO.choose_action(s))[0]