今天我们开始讲下A3C。
解决问题
收敛速度慢是之前Actor-Critic算法的问题。
对此提出三点改进:
1.还记得之前的参数更新公式:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
S
t
,
A
)
δ
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(S_t,A)\delta
θ=θ+α∇θlogπθ(St,A)δ
A3C使用了另外的优势函数形式:
A
(
S
,
A
,
w
,
β
)
=
Q
(
S
,
A
,
w
,
α
,
β
)
−
V
(
S
,
w
,
α
)
A(S,A,w,\beta) = Q(S,A, w, \alpha, \beta) - V(S,w,\alpha)
A(S,A,w,β)=Q(S,A,w,α,β)−V(S,w,α)
参数更新形式:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
s
t
,
a
t
)
A
(
S
,
A
,
w
,
β
)
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(s_t,a_t)A(S,A,w,\beta)
θ=θ+α∇θlogπθ(st,at)A(S,A,w,β)
以上是对Actor的更新。
还有有一个小的优化点就是在Actor-Critic策略函数的损失函数中,加入了策略π的熵项,系数为c, 即策略参数的梯度更新和Actor-Critic相比变成了这样:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
s
t
,
a
t
)
A
(
S
,
t
)
+
c
∇
θ
H
(
π
(
S
t
,
θ
)
)
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(s_t,a_t)A(S,t) + c\nabla_{\theta}H(\pi(S_t, \theta))
θ=θ+α∇θlogπθ(st,at)A(S,t)+c∇θH(π(St,θ))
上面看到对熵进行求导,减小熵值,尽量不要让策略输出每一个行为的概率相等。
以上是对Actor网络参数的更新
我们着重看下
A
(
S
,
A
,
t
)
=
Q
(
S
,
A
)
−
V
(
S
)
A(S,A,t) = Q(S,A) - V(S)
A(S,A,t)=Q(S,A)−V(S)
因为
Q
(
S
,
A
)
=
R
+
γ
V
(
S
′
)
Q(S,A) = R + \gamma V(S')
Q(S,A)=R+γV(S′)
所以
A
(
S
,
A
,
t
)
=
R
+
γ
V
(
S
′
)
−
V
(
S
)
A(S,A,t) = R + \gamma V(S') - V(S)
A(S,A,t)=R+γV(S′)−V(S)
当然以上是单步采样的结果
A3C使用多步采样
A
(
S
,
t
)
=
R
t
+
+
γ
R
t
+
1
+
.
.
.
γ
n
−
1
R
t
+
n
−
1
+
γ
n
V
(
S
′
)
−
V
(
S
)
A(S,t) = R_t + + \gamma R_{t+1} +...\gamma^{n-1} R_{t+n-1} + \gamma^n V(S') - V(S)
A(S,t)=Rt++γRt+1+...γn−1Rt+n−1+γnV(S′)−V(S)
以上是对Critic参数的更新
2.异步训练框架
图中上面的Global Network就是上一节说的共享的公共部分,主要是一个公共的神经网络模型,这个神经网络包括Actor网络和Critic网络两部分的功能。下面有n个worker线程,每个线程里有和公共的神经网络一样的网络结构,每个线程会独立的和环境进行交互得到经验数据,这些线程之间互不干扰,独立运行。
每个线程和环境交互到一定量的数据后,就计算在自己线程里的神经网络损失函数的梯度,但是这些梯度却并不更新自己线程里的神经网络,而是去更新公共的神经网络。也就是n个线程会独立的使用累积的梯度分别更新公共部分的神经网络模型参数。每隔一段时间,线程会将自己的神经网络的参数更新为公共神经网络的参数,进而指导后面的环境交互
3.接下来是网络结构优化
这里我们将两个网络放到了一起,同时算出
v
v
v和
π
\pi
π
算法流程
这里我们对A3C算法流程做一个总结,由于A3C是异步多线程的,我们这里给出任意一个线程的算法流程。
输入:公共部分的A3C神经网络结构,对应参数位
θ
θ
θ,
w
w
w,本线程的A3C神经网络结构,对应参数
θ
′
θ^′
θ′,
w
′
w^′
w′, 全局共享的迭代轮数
T
T
T,全局最大迭代次数
T
m
a
x
T_{max}
Tmax, 线程内单次迭代时间序列最大长度
T
l
o
c
a
l
T_{local}
Tlocal,状态特征维度
n
n
n, 动作集
A
A
A, 步长
α
α
α,
β
β
β,熵系数
c
c
c, 衰减因子
γ
γ
γ, 探索率
ϵ
ϵ
ϵ
输入:公共部分的A3C神经网络参数
θ
θ
θ,
w
w
w
1. 更新时间序列
t
t
t=1
2. 重置Actor和Critic的梯度更新量:
d
θ
d_θ
dθ←0,
d
w
d_w
dw←0
3. 从公共部分的A3C神经网络同步参数到本线程的神经网络:
θ
′
=
θ
θ^′=θ
θ′=θ,
w
′
=
w
w^′=w
w′=w
4.
t
s
t
a
r
t
=
t
t_{start}=t
tstart=t,初始化状态
s
t
s_t
st
5. 基于策略
π
(
a
t
∣
s
t
;
θ
)
π(a_t|s_t;θ)
π(at∣st;θ)选择出动作
a
t
a_t
at
6. 执行动作
a
t
a_t
at得到奖励
r
t
r_t
rt和新状态
s
t
+
1
s_{t+1}
st+1
7.
t
←
t
+
1
t←t+1
t←t+1,
T
←
T
+
1
T←T+1
T←T+1
8. 如果st是终止状态,或
t
−
t
s
t
a
r
t
=
=
t
l
o
c
a
l
t−t_{start}==t_{local}
t−tstart==tlocal,则进入步骤9,否则回到步骤5
9. 计算最后一个时间序列位置
s
t
s_t
st的
Q
(
s
,
t
)
Q(s,t)
Q(s,t):
Q
(
s
,
i
)
=
r
i
+
γ
Q
(
s
,
i
+
1
)
Q(s,i) = r_i + \gamma Q(s,i+1)
Q(s,i)=ri+γQ(s,i+1)
10. for
i
∈
(
t
−
1
,
t
−
2
,
.
.
.
t
s
t
a
r
t
)
i∈(t−1,t−2,...t_{start})
i∈(t−1,t−2,...tstart):
1) 计算每个时刻的
Q
(
s
,
i
)
Q(s,i)
Q(s,i):
Q
(
s
,
i
)
=
r
i
+
γ
Q
(
s
,
i
+
1
)
Q(s,i)=r_i+γQ(s,i+1)
Q(s,i)=ri+γQ(s,i+1)
2) 累计Actor的本地梯度更新:
d
θ
←
d
θ
+
∇
θ
′
l
o
g
π
θ
′
(
s
i
,
a
i
)
(
Q
(
s
,
i
)
−
V
(
S
i
,
w
′
)
)
+
c
∇
θ
′
H
(
π
(
s
i
,
θ
′
)
)
d\theta \gets d\theta + \nabla_{\theta '}log \pi_{\theta'}(s_i,a_i)(Q(s,i)-V(S_i, w')) + c\nabla_{\theta '}H(\pi(s_i, \theta '))
dθ←dθ+∇θ′logπθ′(si,ai)(Q(s,i)−V(Si,w′))+c∇θ′H(π(si,θ′))
3) 累计Critic的本地梯度更新:
d
w
←
d
w
+
∂
(
Q
(
s
,
i
)
−
V
(
S
i
,
w
′
)
)
2
∂
w
′
dw \gets dw + \frac{\partial (Q(s,i)-V(S_i, w'))^2}{\partial w'}
dw←dw+∂w′∂(Q(s,i)−V(Si,w′))2
11. 更新全局神经网络的模型参数:
θ
=
θ
−
α
d
θ
,
w
=
w
−
β
d
w
\theta = \theta -\alpha d\theta,\;w = w -\beta dw
θ=θ−αdθ,w=w−βdw
12. 如果T>Tmax,则算法结束,输出公共部分的A3C神经网络参数θ,w,否则进入步骤3
以上就是A3C算法单个线程的算法流程。
代码
改代码感觉还有错误,还望指正,根据tensorflow代码,Pytorch实现
改代码需命令行输入 python a3c.py
# -*- coding: utf-8 -*-
"""
Created on Wed Dec 11 10:54:30 2019
@author: asus
"""
#######################################################################
# Copyright (C) #
# 2016 - 2019 Pinard Liu(liujianping-ok@163.com) #
# https://www.cnblogs.com/pinard #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
## reference from MorvanZhou's A3C code on Github, minor update:##
##https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/10_A3C/A3C_discrete_action.py ##
## https://www.cnblogs.com/pinard/p/10334127.html ##
## 强化学习(十五) A3C ##
import numpy as np
import gym
import matplotlib.pyplot as plt
import torch
from torch import nn
from torch.nn import functional as F
import torch.multiprocessing as mp
GAME = 'CartPole-v0'
OUTPUT_GRAPH = True
LOG_DIR = './log'
N_WORKERS = 3
MAX_GLOBAL_EP = 3000
GLOBAL_NET_SCOPE = 'Global_Net'
UPDATE_GLOBAL_ITER = 100
GAMMA = 0.9
ENTROPY_BETA = 0.001
LR_A = 0.001 # learning rate for actor
LR_C = 0.001 # learning rate for critic
GLOBAL_RUNNING_R = []
GLOBAL_EP = 0
STEP = 3000 # Step limitation in an episode
TEST = 10 # The number of experiment test every 100 episode
GLOBAL_EP = 0
env = gym.make(GAME)
N_S = env.observation_space.shape[0]
N_A = env.action_space.n
def ensure_shared_grads(model, shared_model):
for param, shared_param in zip(model.parameters(),
shared_model.parameters()):
shared_param._grad = param.grad
class Actor(torch.nn.Module):
def __init__(self):
super(Actor, self).__init__()
self.actor_fc1 = nn.Linear(N_S, 200)
self.actor_fc1.weight.data.normal_(0, 0.6)
self.actor_fc2 = nn.Linear(200, N_A)
self.actor_fc2.weight.data.normal_(0, 0.6)
def forward(self, s):
l_a = F.relu6(self.actor_fc1(s))
a_prob = F.softmax(self.actor_fc2(l_a))
return a_prob
class Critic(torch.nn.Module):
def __init__(self):
super(Critic, self).__init__()
self.critic_fc1 = nn.Linear(N_S, 100)
self.critic_fc1.weight.data.normal_(0, 0.6)
self.critic_fc2 = nn.Linear(100, 1)
self.critic_fc2.weight.data.normal_(0, 0.6)
def forward(self, s):
l_c = F.relu6(self.critic_fc1(s))
v = self.critic_fc2(l_c)
return v
class ACNet(torch.nn.Module):
def __init__(self):
super(ACNet, self).__init__()
self.actor = Actor()
self.critic = Critic()
self.actor_optimizer = torch.optim.Adam(params=self.actor.parameters(), lr=0.001)
self.critic_optimizer = torch.optim.Adam(params=self.critic.parameters(), lr=0.001)
def return_td(self, s, v_target):
self.v = self.critic(s)
td = v_target - self.v
return td.detach()
def return_c_loss(self, s, v_target):
self.v = self.critic(s)
self.td = v_target - self.v
self.c_loss = (self.td**2).mean()
return self.c_loss
def return_a_loss(self, s, td, a_his):
self.a_prob = self.actor(s)
a_his = a_his.unsqueeze(1)
one_hot = torch.zeros(a_his.shape[0], N_A).scatter_(1, a_his, 1)
log_prob = torch.sum(torch.log(self.a_prob + 1e-5) * one_hot, dim=1, keepdim=True)
exp_v = log_prob*td
entropy = -torch.sum(self.a_prob * torch.log(self.a_prob + 1e-5),dim=1, keepdim=True)
self.exp_v = ENTROPY_BETA * entropy + exp_v
self.a_loss = (-self.exp_v).mean()
return self.a_loss
def choose_action(self, s): # run by a local
prob_weights = self.actor(torch.FloatTensor(s[np.newaxis, :]))
action = np.random.choice(range(prob_weights.shape[1]),
p=prob_weights.detach().numpy().ravel()) # select action w.r.t the actions prob
return action
def work(name, AC, lock):
env = gym.make(GAME).unwrapped
global GLOBAL_RUNNING_R, GLOBAL_EP
total_step = 1
buffer_s, buffer_a, buffer_r = [], [], []
model = ACNet()
actor_optimizer = AC.actor_optimizer
critic_optimizer = AC.critic_optimizer
while GLOBAL_EP < MAX_GLOBAL_EP:
lock.acquire()
model.load_state_dict(AC.state_dict())
s = env.reset()
ep_r = 0
while True:
# if self.name == 'W_0':
# self.env.render()
a = model.choose_action(s)
s_, r, done, info = env.step(a)
if done: r = -5
ep_r += r
buffer_s.append(s)
buffer_a.append(a)
buffer_r.append(r)
if total_step % UPDATE_GLOBAL_ITER == 0 or done: # update global and assign to local net
if done:
v_s_ = 0 # terminal
else:
#if not done, v(s)
v_s_ = model.critic(torch.FloatTensor(s_[np.newaxis, :]))[0, 0]
buffer_v_target = []
#create buffer_v_target
for r in buffer_r[::-1]: # reverse buffer r
v_s_ = r + GAMMA * v_s_
buffer_v_target.append(v_s_)
buffer_v_target.reverse()
buffer_s, buffer_a, buffer_v_target = torch.FloatTensor(buffer_s), torch.LongTensor(buffer_a), torch.FloatTensor(buffer_v_target)
td_error = model.return_td(buffer_s, buffer_v_target)
c_loss = model.return_c_loss(buffer_s, buffer_v_target)
critic_optimizer.zero_grad()
c_loss.backward()
ensure_shared_grads(model, AC)
critic_optimizer.step()
a_loss = model.return_a_loss(buffer_s, td_error, buffer_a)
actor_optimizer.zero_grad()
a_loss.backward()
ensure_shared_grads(model, AC)
actor_optimizer.step()
buffer_s, buffer_a, buffer_r = [], [], []
s = s_
total_step += 1
if done:
if len(GLOBAL_RUNNING_R) == 0: # record running episode reward
GLOBAL_RUNNING_R.append(ep_r)
else:
GLOBAL_RUNNING_R.append(0.99 * GLOBAL_RUNNING_R[-1] + 0.01 * ep_r)
# print(
# self.name,
# "Ep:", GLOBAL_EP,
# "| Ep_r: %i" % GLOBAL_RUNNING_R[-1],
# )
print("name: %s" % name, "| Ep_r: %i" % GLOBAL_RUNNING_R[-1])
print(name)
GLOBAL_EP += 1
break
lock.release()
if __name__ == "__main__":
GLOBAL_AC = ACNet() # we only need its params
GLOBAL_AC.share_memory()
workers = []
lock = mp.Lock()
# Create worker
for i in range(N_WORKERS):
p = mp.Process(target=work, args=(i,GLOBAL_AC,lock,))
p.start()
workers.append(p)
for p in workers:
p.join()
total_reward = 0
for i in range(TEST):
state = env.reset()
for j in range(STEP):
# env.render()
action = GLOBAL_AC.choose_action(state) # direct action for test
state, reward, done, _ = env.step(action)
total_reward += reward
if done:
break
ave_reward = total_reward / TEST
print('episode: ', GLOBAL_EP, 'Evaluation Average Reward:', ave_reward)
plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R)
plt.xlabel('step')
plt.ylabel('Total moving reward')
plt.show()