在之前,我们学习了基于策略(Policy Based)的强化学习方法,我们使用的是蒙特卡罗策略梯度reinforce算法。
问题:
1.但是由于该算法需要完整的状态序列,同时单独对策略函数进行迭代更新,
2.不容易收敛
改进
在上篇文章中(基于策略的强化学习方法),我们做了如下的改进,也就是使用了神经网络进行了如下的近似。
第一个就是策略的近似:
π
θ
(
s
,
a
)
=
P
(
a
∣
s
,
θ
)
≈
π
(
a
∣
s
)
\pi_{\theta}(s,a) = P(a|s,\theta)\approx \pi(a|s)
πθ(s,a)=P(a∣s,θ)≈π(a∣s)
第二个就是价值函数的近似:
状态函数近似:
v
^
(
s
,
w
)
≈
v
π
(
s
)
\hat{v}(s, w) \approx v_{\pi}(s)
v^(s,w)≈vπ(s)
价值函数近似:
q
^
(
s
,
a
,
w
)
≈
q
π
(
s
,
a
)
\hat{q}(s,a,w) \approx q_{\pi}(s,a)
q^(s,a,w)≈qπ(s,a)
还记得我们的策略梯度公式如下:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
s
t
,
a
t
)
v
t
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(s_t,a_t) v_t
θ=θ+α∇θlogπθ(st,at)vt
在蒙特卡罗策略梯度reinforce算法中,我们计算出了
v
t
v_t
vt。但是这次我们使用神经网络来计算出
v
t
v_t
vt,即用一个Q网络来做为Critic, 这个Q网络的输入可以是状态,而输出是每个动作的价值或者最优动作的价值。
π
θ
(
s
t
,
a
t
)
\pi_{\theta}(s_t,a_t)
πθ(st,at)属于Actor,选择行动。
总体的思路就是Critic通过Q网络计算状态的最优价值vt, 而Actor利用vt这个最优价值迭代更新策略函数的参数θ,进而选择动作,并得到反馈和新的状态,Critic使用反馈和新的状态更新Q网络参数w, 在后面Critic会使用新的网络参数w来帮Actor计算状态的最优价值vt。
当然除此之外,还有以下几个形式。
a) 基于状态价值:这是我们上一节使用的评估点,这样Actor的策略函数参数更新的法公式是:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
s
t
,
a
t
)
V
(
s
,
w
)
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(s_t,a_t) V(s,w)
θ=θ+α∇θlogπθ(st,at)V(s,w)
b) 基于动作价值:在DQN中,我们一般使用的都是动作价值函数Q来做价值评估,这样Actor的策略函数参数更新的法公式是:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
s
t
,
a
t
)
Q
(
s
,
a
,
w
)
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(s_t,a_t) Q(s,a,w)
θ=θ+α∇θlogπθ(st,at)Q(s,a,w)
c) 基于TD误差:我们讲到了TD误差,它的表达式是
δ
(
t
)
=
R
t
+
1
+
γ
V
(
S
t
+
1
)
−
V
(
S
t
)
δ(t)=R_{t+1}+γV(S_{t+1})−V(S_t)
δ(t)=Rt+1+γV(St+1)−V(St)或者
δ
(
t
)
=
R
t
+
1
+
γ
Q
(
S
t
+
1
,
A
t
+
1
)
−
Q
(
S
t
,
A
t
)
δ(t)=R_{t+1}+γQ(S_{t+1},A_{t+1})−Q(S_t,A_t)
δ(t)=Rt+1+γQ(St+1,At+1)−Q(St,At), 这样Actor的策略函数参数更新的法公式是:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
s
t
,
a
t
)
δ
(
t
)
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(s_t,a_t)\delta(t)
θ=θ+α∇θlogπθ(st,at)δ(t)
我们着重讲一下这个
δ
(
t
)
\delta(t)
δ(t),它其实也是Q网络,输入状态或者状态动作得到得分,只不过它的参数更新,并非像之前的那样,而是使用
R
t
+
1
+
γ
Q
(
S
t
+
1
,
A
t
+
1
)
−
Q
(
S
t
,
A
t
)
R_{t+1}+γQ(S_{t+1},A_{t+1})−Q(S_t,A_t)
Rt+1+γQ(St+1,At+1)−Q(St,At)和
R
t
+
1
+
γ
V
(
S
t
+
1
)
−
V
(
S
t
)
R_{t+1}+γV(S_{t+1})−V(S_t)
Rt+1+γV(St+1)−V(St)。也就是输入前后两个状态,得到Q值然后以该公式为损失函数更新Q网络。
算法流程
算法输入:迭代轮数 T T T,状态特征维度 n n n, 动作集 A A A, 步长 α α α, β β β,衰减因子 γ γ γ, 探索率 ϵ ϵ ϵ, Critic网络结构和Actor网络结构。
输出:Actor 网络参数
θ
θ
θ, Critic网络参数
w
w
w
1. 随机初始化所有的状态和动作对应的价值
Q
Q
Q
2. for i from 1 to T,进行迭代。
a) 初始化
S
S
S为当前状态序列的第一个状态, 拿到其特征向量
ϕ
(
S
)
ϕ(S)
ϕ(S)
b) 在Actor网络中使用
ϕ
(
S
)
ϕ(S)
ϕ(S)作为输入,输出动作
A
A
A,基于动作
A
A
A得到新的状态
S
′
S^′
S′,反馈
R
R
R。
c) 在Critic网络中分别使用
ϕ
(
S
)
ϕ(S)
ϕ(S),
ϕ
(
S
′
)
ϕ(S^′)
ϕ(S′)作为输入,得到Q值输出V(S),V(S′)
d) 计算TD误差
δ
=
R
+
γ
V
(
S
′
)
−
V
(
S
)
δ=R+γV(S^′)−V(S)
δ=R+γV(S′)−V(S)
e) 使用均方差损失函数
∑
(
R
+
γ
V
(
S
’
)
−
V
(
S
,
w
)
)
2
\sum\limits(R +\gamma V(S’) -V(S,w))^2
∑(R+γV(S’)−V(S,w))2作Critic网络参数
w
w
w的梯度更新
f) 更新Actor网络参数
θ
θ
θ:
θ
=
θ
+
α
∇
θ
l
o
g
π
θ
(
S
t
,
A
)
δ
\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(S_t,A)\delta
θ=θ+α∇θlogπθ(St,A)δ
对于Actor的分值函数∇θlogπθ(St,A),可以选择softmax或者高斯分值函数
代码
根据tensorflow代码,写的pytorch
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 9 14:12:17 2019
@author: asus
"""
#Disadvantage:Difficult to converge
import gym
import torch
import numpy as np
import torch.nn.functional as F
# Hyper Parameters
GAMMA = 0.95 # discount factor
LEARNING_RATE=0.01
class Actor(torch.nn.Module):
def __init__(self, env):
super(Actor, self).__init__()
self.state_dim = env.observation_space.shape[0]
self.action_dim = env.action_space.n
self.fc1 = torch.nn.Linear(self.state_dim, 20)
self.fc1.weight.data.normal_(0, 0.1)
self.fc2 = torch.nn.Linear(20, self.action_dim)
self.fc1.weight.data.normal_(0, 0.1)
def create_softmax_network(self, state_input):
self.h_layer = F.relu(self.fc1(state_input))
self.softmax_input = self.fc2(self.h_layer)
all_act_prob = F.softmax(self.softmax_input)
return all_act_prob
def forward(self, state_input, acts, td_error):
self.h_layer = F.relu(self.fc1(state_input))
self.softmax_input = self.fc2(self.h_layer)
# print(self.softmax_input)
# print(acts)
neg_log_prob = F.cross_entropy(self.softmax_input, acts, reduce=False)
# print("vt:", vt)
# print("neg_log_prob:", neg_log_prob)
exp = (neg_log_prob * td_error).mean()
return -exp
EPSILON = 0.01 # final value of epsilon
REPLAY_SIZE = 10000 # experience replay buffer size
BATCH_SIZE = 32 # size of minibatch
REPLACE_TARGET_FREQ = 10 # frequency to update target Q network
class Critic(torch.nn.Module):
def __init__(self, env):
super(Critic, self).__init__()
# init some parameters
self.time_step = 0
self.epsilon = EPSILON
self.state_dim = env.observation_space.shape[0]
self.action_dim = env.action_space.n
self.fc1 = torch.nn.Linear(self.state_dim, 20)
self.fc1.weight.data.normal_(0, 0.1)
self.fc2 = torch.nn.Linear(20, 1)
self.fc2.weight.data.normal_(0, 0.1)
def create_q_value(self, state_input):
h_layerq = F.relu(self.fc1(torch.FloatTensor(state_input)))
Q_value = self.fc2(h_layerq)
return Q_value
def return_td_error(self, state_input , next_value, reward):
Q_value = self.create_q_value(state_input)
td_error = reward + GAMMA*next_value - Q_value
return td_error
def forward(self, state_input , next_value, reward):
td_error = self.return_td_error(state_input , next_value, reward)
# print(td_error)
loss = (td_error**2).sum()
return loss
class Policy_Gradient():
def __init__(self, actor, critic):
self.actor = actor
self.critic = critic
self.critic_optimizer = torch.optim.Adam(params=self.critic.parameters(), lr=EPSILON)
self.actor_optimizer = torch.optim.Adam(params=self.actor.parameters(), lr=LEARNING_RATE)
self.ep_obs, self.ep_as, self.ep_rs, self.ep_next = [], [], [], []
# self.num += 1
def train_critic(self):
self.ep_obs = torch.FloatTensor(self.ep_obs); self.ep_rs = torch.FloatTensor(self.ep_rs); self.ep_next = torch.FloatTensor(self.ep_next)
v_ = self.critic.create_q_value(self.ep_next)
# print(self.ep_rs)
td_error = self.critic.return_td_error(self.ep_obs, v_, self.ep_rs)
critic_loss = self.critic(self.ep_obs, v_, self.ep_rs)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
return td_error
def train_actor(self, td_error):
actor_loss = self.actor(torch.FloatTensor(self.ep_obs),
torch.LongTensor(self.ep_as),
td_error)
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
self.ep_obs, self.ep_as, self.ep_rs, self.ep_next = [], [], [], []
def store_transition(self, s, a, r, n):
self.ep_obs.append(s)
self.ep_as.append(a)
self.ep_rs.append(r)
self.ep_next.append(n)
def choose_action(self, observation):
prob_weights = self.actor.create_softmax_network(torch.FloatTensor(observation[np.newaxis, :]))
action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.detach().numpy().ravel())
return action
# Hyper Parameters
ENV_NAME = 'CartPole-v0'
EPISODE = 3000 # Episode limitation
STEP = 3000 # Step limitation in an episode
TEST = 10 # The number of experiment test every 100 episode
def main():
# initialize OpenAI Gym env and dqn agent
env = gym.make(ENV_NAME)
actor = Actor(env)
critic = Critic(env)
a_c_train = Policy_Gradient(actor, critic)
for episode in range(EPISODE):
# initialize task
state = env.reset()
# Train
for step in range(STEP):
action = a_c_train.choose_action(state) # e-greedy action for train
next_state,reward,done,_ = env.step(action)
a_c_train.store_transition(state, action, reward, next_state)
state = next_state
if done:
# print(step)
td_error = a_c_train.train_critic() # gradient = grad[r + gamma * V(s_) - V(s)]
a_c_train.train_actor(td_error.detach()) # true_gradient = grad[logPi(s,a) * td_error]
break
# Test every 100 episodes
if episode % 100 == 0:
total_reward = 0
for i in range(TEST):
state = env.reset()
for j in range(STEP):
# env.render()
action = a_c_train.choose_action(state) # direct action for test
state,reward,done,_ = env.step(action)
total_reward += reward
if done:
break
ave_reward = total_reward/TEST
print ('episode: ',episode,'Evaluation Average Reward:',ave_reward)
if __name__ == '__main__':
main()