目录
1. 定义算法
最基础的策略梯度算法就是REINFORCE算法,又称作Monte-Carlo Policy Gradient算法。我们策略优化的目标如下:
1.1. 策略函数设计
1.2. 模型设计
前面讲到,尽管本次演示是离散空间,但是由于动作维度等于2,此时就可以用特殊的高斯分布来表示策略函数,即伯努利分布。伯努利的分布实际上是用一个概率作为输入,然后从中采样动作,伯努利采样出来的动作只可能是0或1,就像投掷出硬币的正反面。在这种情况下,我们的策略模型就需要在MLP的基础上,将状态作为输入,将动作作为倒数第二层输出,并在最后一层增加激活函数来输出对应动作的概率。不清楚激活函数作用的同学可以再看一遍深度学习相关的知识,简单来说其作用就是增加神经网络的非线性。既然需要输出对应动作的概率,那么输出的值需要处于0-1之间,此时sigmoid函数刚好满足我们的需求,实现代码参考如下。
import torch
import torch.nn as nn
import torch.nn.functional as F
class PGNet(nn.Module):
def __init__(self, input_dim,output_dim,hidden_dim=128):
""" 初始化q网络,为全连接网络
input_dim: 输入的特征数即环境的状态维度
output_dim: 输出的动作维度
"""
super(PGNet, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim) # 输入层
self.fc2 = nn.Linear(hidden_dim,hidden_dim) # 隐藏层
self.fc3 = nn.Linear(hidden_dim, output_dim) # 输出层
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = torch.sigmoid(self.fc3(x))
return x
1.3. 更新函数设计
前面提到我们的优化目标也就是策略梯度算法的损失函数如下:
import torch
from torch.distributions import Bernoulli
from torch.autograd import Variable
import numpy as np
class PolicyGradient:
def __init__(self, model,memory,cfg):
self.gamma = cfg['gamma']
self.device = torch.device(cfg['device'])
self.memory = memory
self.policy_net = model.to(self.device)
self.optimizer = torch.optim.RMSprop(self.policy_net.parameters(), lr=cfg['lr'])
def sample_action(self,state):
state = torch.from_numpy(state).float()
state = Variable(state)
probs = self.policy_net(state)
m = Bernoulli(probs) # 伯努利分布
action = m.sample()
action = action.data.numpy().astype(int)[0] # 转为标量
return action
def predict_action(self,state):
state = torch.from_numpy(state).float()
state = Variable(state)
probs = self.policy_net(state)
m = Bernoulli(probs) # 伯努利分布
action = m.sample()
action = action.data.numpy().astype(int)[0] # 转为标量
return action
def update(self):
state_pool,action_pool,reward_pool= self.memory.sample()
state_pool,action_pool,reward_pool = list(state_pool),list(action_pool),list(reward_pool)
# Discount reward
running_add = 0
for i in reversed(range(len(reward_pool))):
if reward_pool[i] == 0:
running_add = 0
else:
running_add = running_add * self.gamma + reward_pool[i]
reward_pool[i] = running_add
# Normalize reward
reward_mean = np.mean(reward_pool)
reward_std = np.std(reward_pool)
for i in range(len(reward_pool)):
reward_pool[i] = (reward_pool[i] - reward_mean) / reward_std
# Gradient Desent
self.optimizer.zero_grad()
for i in range(len(reward_pool)):
state = state_pool[i]
action = Variable(torch.FloatTensor([action_pool[i]]))
reward = reward_pool[i]
state = Variable(torch.from_numpy(state).float())
probs = self.policy_net(state)
m = Bernoulli(probs)
loss = -m.log_prob(action) * reward # Negtive score function x reward
# print(loss)
loss.backward()
self.optimizer.step()
self.memory.clear()