策略梯度(Policy Gradient)

本章我们主要讲解Policy Based
解决问题:
之前的强化学习算法是Value Based的方法,主要就是根据Q值和V值,进行选择。但是它有以下几个缺点。
第一点是对连续动作的处理能力不足。
第二点是对受限状态下的问题处理能力不足。导致真实环境下本来不同的两个状态却再我们建模后拥有相同的特征描述。
第三点是无法解决随机策略问题。Value Based强化学习方法对应的最优策略通常是确定性策略,因为其是从众多行为价值中选择一个最大价值的行为,而有些问题的最优策略却是随机策略,这种情况下同样是无法通过基于价值的学习来求解的。

理论推导

接下来我们通过神经网络近似表示策略 π \pi π和动作价值函数 q {q} q
q ^ ( s , a , w ) ≈ q π ( s , a ) \hat{q}(s,a,w) \approx q_{\pi}(s,a) q^(s,a,w)qπ(s,a)
π θ ( s , a ) = P ( a ∣ s , θ ) ≈ π ( a ∣ s ) \pi_{\theta}(s,a) = P(a|s,\theta)\approx \pi(a|s) πθ(s,a)=P(as,θ)π(as)
然后我们来表示 π θ ( s , a ) \pi_{\theta}(s,a) πθ(s,a), ϕ ( s , a ) \phi(s,a) ϕ(s,a)表示状态和行为的特征,如下:
π θ ( s , a ) = e ϕ ( s , a ) T θ ∑ b e ϕ ( s , b ) T θ \pi_{\theta}(s,a) = \frac{e^{\phi(s,a)^T\theta}}{\sum\limits_be^{\phi(s,b)^T\theta}} πθ(s,a)=beϕ(s,b)Tθeϕ(s,a)Tθ
这里给定需要优化的函数目标为 J ( θ ) J(\theta) J(θ),最终对θ求导的梯度都可以表示为 ∇ θ J ( θ ) = E π θ [ ∇ θ l o g π θ ( s , a ) Q π ( s , a ) ] \nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}log \pi_{\theta}(s,a) Q_{\pi}(s,a)] θJ(θ)=Eπθ[θlogπθ(s,a)Qπ(s,a)]

然后这里有两种优化目标:
第一种优化目标:
平均价值
ρ ( π ) = lim ⁡ n → ∞ 1 n E { r 1 + r 2 + . . . + r n ∣ π } = ∑ s d π ( s ) V π = ∑ s d π ( s ) ∑ a π ( s , a ) R s a ρ(\pi) = \lim\limits_{n\to\infty}\frac{1}{n}\mathbb{E}\{r_1+r_2+...+r_n|\pi\}=∑\limits_{s}d^{\pi} (s)V^\pi=∑\limits_sd^\pi(s)∑\limits_a\pi(s,a)R_s^a ρ(π)=nlimn1E{r1+r2+...+rnπ}=sdπ(s)Vπ=sdπ(s)aπ(s,a)Rsa
d π ( s ) = lim ⁡ t → ∞ P r { s t = s ∣ s 0 , π } d^{\pi}(s) = \lim\limits_{t\to\infty}Pr\{s_t=s|s_0,\pi\} dπ(s)=tlimPr{st=ss0,π}
d π ( s ) d^{\pi}(s) dπ(s)是策略 π \pi π下的符合马尔科夫链静态分布。
此时对状态行为Q值的定义是:
Q π ( s , a ) = ∑ t = 1 ∞ E { r t − ρ ( π ) ∣ s 0 = s , a 0 = a , π ) } , ∀ s ∈ S , a ∈ A Q^{\pi}(s,a)=∑\limits_{t=1}^{\infty}\mathbb{E}\{r_t-ρ(\pi)|s_0=s,a_0=a,\pi)\},{\forall}s\in{S},a\in{A} Qπ(s,a)=t=1E{rtρ(π)s0=s,a0=a,π)},sS,aA

第二种优化目标是初始状态收获的期望:
ρ ( π ) = V π ( s 0 ) = E { ∑ t = 1 ∞ γ t − 1 r t ∣ s 0 , π } ρ(\pi) =V^\pi(s_0)=\mathbb{E}\{∑\limits_{t=1}^{\infty}\gamma^{t-1}r_t|s_0,\pi\} ρ(π)=Vπ(s0)=E{t=1γt1rts0,π}
Q Q Q值定义是:
Q π ( s , a ) = E { ∑ k = 1 ∞ γ k − 1 r t + k ∣ s t = s , a t = a , π } Q^{\pi}(s,a) = \mathbb{E}\{∑\limits_{k=1}^{\infty}\gamma^{k-1}r_{t+k}|s_t=s,a_t=a,\pi\} Qπ(s,a)=E{k=1γk1rt+kst=s,at=a,π}
这里 γ ∈ [ 0 , 1 ] \gamma\in[0,1] γ[0,1],同时设置 d π ( s ) = ∑ t = 0 ∞ γ t P r { s t = s ∣ s 0 , π } d^\pi(s)=∑\limits_{t=0}^{\infty}\gamma^tPr\{s_t=s|s_0,\pi\} dπ(s)=t=0γtPr{st=ss0,π}
不论是哪种形似,它们的梯度求导最终归于以下形式:
∂ ρ ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \frac{\partial ρ}{\partial \theta}=∑_sd^\pi(s)∑_a\frac{\partial\pi(s,a)}{\partial\theta}Q^\pi(s,a) θρ=sdπ(s)aθπ(s,a)Qπ(s,a)
以下为证明
proof:
以下为平均价值推导
∂ V π ( s ) ∂ θ = ∂ ∂ θ ∑ π ( s , a ) Q π ( s , a ) ∀ s ∈ S \frac{\partial V^\pi(s)}{\partial \theta}=\frac{\partial}{\partial \theta}∑\pi(s,a)Q^\pi(s,a) \quad \forall s\in S θVπ(s)=θπ(s,a)Qπ(s,a)sS
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ Q π ( s , a ) ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}Q^\pi(s,a)] =a[θπ(s,a)Qπ(s,a)+π(s,a)θQπ(s,a)]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ [ R s a − ρ ( π ) + ∑ s ′ P s s ′ a V π ( s ′ ) ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}[R_s^a-ρ(\pi)+∑\limits_{s'}P_{ss'}^aV^\pi(s')]] =a[θπ(s,a)Qπ(s,a)+π(s,a)θ[Rsaρ(π)+sPssaVπ(s)]]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ [ R s a − ρ ( π ) + ∑ s ′ P s s ′ a V π ( s ′ ) ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}[R_s^a-ρ(\pi)+∑\limits_{s'}P_{ss'}^aV^\pi(s')]] =a[θπ(s,a)Qπ(s,a)+π(s,a)θ[Rsaρ(π)+sPssaVπ(s)]]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) [ − ∂ ρ ∂ θ + ∑ s ′ P s s ′ a ∂ V π ( s ′ ) ∂ θ ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)[-\frac{\partial ρ}{\partial \theta}+∑\limits_{s'}P_{ss'}^a\frac{\partial V^\pi(s')}{\partial \theta}]] =a[θπ(s,a)Qπ(s,a)+π(s,a)[θρ+sPssaθVπ(s)]]
therefore,
∂ V π ∂ θ = ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∑ s ′ P s s ′ a ∂ V π ( s ′ ) ∂ θ ] − ∂ V π ( s ) ∂ θ \frac{\partial V^\pi}{\partial \theta}=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)∑\limits_{s'}P_{ss'}^a\frac{\partial V^\pi(s')}{\partial \theta}]-\frac{\partial V^\pi(s)}{\partial \theta} θVπ=a[θπ(s,a)Qπ(s,a)+π(s,a)sPssaθVπ(s)]θVπ(s)
加入 d π ( s ) d^\pi(s) dπ(s),变成以下公式:
∑ s d π ( s ) ∂ V π ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) + ∑ s d π ( s ) ∑ a π ( s , a ) ∑ s ′ P s s ′ a ∂ V π ( s ′ ) ∂ θ − ∑ s d π ( s ) ∂ V π ( s ) ∂ θ ∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+∑\limits_{s}d^{\pi} (s)∑\limits_{a}\pi(s,a)∑\limits_{s'}P_{ss'}^a\frac{\partial V^\pi(s')}{\partial \theta}-∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi(s)}{\partial \theta} sdπ(s)θVπ=sdπ(s)aθπ(s,a)Qπ(s,a)+sdπ(s)aπ(s,a)sPssaθVπ(s)sdπ(s)θVπ(s)

∑ s d π ( s ) ∂ V π ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) + ∑ s ′ d π ( s ′ ) ∂ V π ( s ′ ) ∂ θ − ∑ s d π ( s ) ∂ V π ( s ) ∂ θ ∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+∑\limits_{s'}d^{\pi} (s')\frac{\partial V^\pi(s')}{\partial \theta}-∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi(s)}{\partial \theta} sdπ(s)θVπ=sdπ(s)aθπ(s,a)Qπ(s,a)+sdπ(s)θVπ(s)sdπ(s)θVπ(s)
= ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \qquad=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)} =sdπ(s)aθπ(s,a)Qπ(s,a)
所以
∂ ρ ∂ θ = ∑ s d π ( s ) ∂ V π ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \frac{\partial ρ}{\partial \theta}=∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)} θρ=sdπ(s)θVπ=sdπ(s)aθπ(s,a)Qπ(s,a)
以上就是平均价值推导。

接下来推导初始状态收获的期望求导:
∂ V π ( s ) ∂ θ = ∂ ∂ θ ∑ π ( s , a ) Q π ( s , a ) ∀ s ∈ S \frac{\partial V^\pi(s)}{\partial \theta}=\frac{\partial}{\partial \theta}∑\pi(s,a)Q^\pi(s,a) \quad \forall s\in S θVπ(s)=θπ(s,a)Qπ(s,a)sS
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ Q π ( s , a ) ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}Q^\pi(s,a)] =a[θπ(s,a)Qπ(s,a)+π(s,a)θQπ(s,a)]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ [ R s a + ∑ s ′ γ P s s ′ a V π ( s ′ ) ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}[R_s^a+∑\limits_{s'} \gamma P_{ss'}^aV^\pi(s')]] =a[θπ(s,a)Qπ(s,a)+π(s,a)θ[Rsa+sγPssaVπ(s)]]
注意我们将对 V π ( s ′ ) V^\pi(s') Vπ(s)迭代持续展开
∂ V π ( s ) ∂ θ = ∑ x ∑ k = 0 ∞ γ k P r ( s → x , k , π ) ∑ a ∂ π ( x , a ) ∂ θ Q π ( x , a ) \frac{\partial V^\pi(s)}{\partial \theta}=∑\limits_{x}∑\limits_{k=0}^{\infty}\gamma^k Pr(s \to x,k,\pi)∑\limits_{a}\frac{\partial \pi(x,a)}{\partial \theta}Q^\pi(x,a) θVπ(s)=xk=0γkPr(sx,k,π)aθπ(x,a)Qπ(x,a)
其中上面的 P r ( s → x , k , π ) Pr(s \to x,k,\pi) Pr(sx,k,π)是指在第k步和策略 π \pi π下,状态s到状态x的概率
∂ ρ ∂ θ = ∂ ∂ θ E { ∑ t = 1 ∞ γ t − 1 r t ∣ s 0 , π } = ∂ ∂ θ V π ( s 0 ) \frac{\partial ρ}{\partial \theta}=\frac{\partial}{\partial \theta}\mathbb{E}\{∑\limits_{t=1}^{\infty}\gamma^{t-1}r_t|s_0,\pi\}=\frac{\partial}{\partial \theta}V^{\pi}(s_0) θρ=θE{t=1γt1rts0,π}=θVπ(s0)
= ∑ x ∑ k = 0 ∞ γ k P r ( s 0 → s , k , π ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \qquad=∑\limits_{x}∑\limits_{k=0}^{\infty}\gamma^k Pr(s_0 \to s,k,\pi)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi(s,a) =xk=0γkPr(s0s,k,π)aθπ(s,a)Qπ(s,a)
这里初始状态收获的期望: d π ( s ) = ∑ t = 0 ∞ γ t P r { s t = s ∣ s 0 , π } d^\pi(s)=∑\limits_{t=0}^{\infty}\gamma^tPr\{s_t=s|s_0,\pi\} dπ(s)=t=0γtPr{st=ss0,π}
所以
∂ ρ ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \frac{\partial ρ}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)} θρ=sdπ(s)aθπ(s,a)Qπ(s,a)
以上两种可以写为以下形式:
∂ ρ ∂ θ = E π [ ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) ] \frac{\partial ρ}{\partial \theta}=\mathbb{E}_\pi[∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}] θρ=Eπ[aθπ(s,a)Qπ(s,a)]
证明参考论文

算法流程

输入:N个蒙特卡罗完整序列,训练步长α
输出:策略函数的参数θ

  1. for 每个蒙特卡罗序列:
    a. 用蒙特卡罗法计算序列每个时间位置t的状态价值 v t v_t vt
    b. 对序列每个时间位置 t t t,使用梯度上升法,更新策略函数的参数 θ θ θ
    θ = θ + α ∇ θ l o g π θ ( s t , a t ) v t θ=θ+α∇_θlogπ_θ(s_t,a_t)v_t θ=θ+αθlogπθ(st,at)vt
  2. 返回策略函数的参数 θ θ θ

以上算法做了修改 v t v_t vt提到 Q Q Q,为无偏估计。加了log为后面的cross_entropy,加的log,详细理解见代码。

代码

思路比较简单,接下来就是pytorch代码,参照tensorflow的代码

# -*- coding: utf-8 -*-
"""
Created on Sun Dec  8 14:21:25 2019

@author: asus
"""

import gym
import torch
import numpy as np
import torch.nn.functional as F

# Hyper Parameters
GAMMA = 0.95 # discount factor
LEARNING_RATE=0.01

class softmax_network(torch.nn.Module):
    def __init__(self, env):
        super(softmax_network, self).__init__()
        self.state_dim = env.observation_space.shape[0]
        self.action_dim = env.action_space.n
        self.fc1 = torch.nn.Linear(self.state_dim, 20)
        self.fc1.weight.data.normal_(0, 0.6)
        self.fc2 = torch.nn.Linear(20, self.action_dim)
        self.fc1.weight.data.normal_(0, 0.6)
        
    def create_softmax_network(self, state_input):
        self.h_layer = F.relu(self.fc1(state_input))
        self.softmax_input = self.fc2(self.h_layer)
        all_act_prob = F.softmax(self.softmax_input)
        return all_act_prob
    
    def forward(self, state_input, acts, vt):
        self.h_layer = F.relu(self.fc1(state_input))
        self.softmax_input = self.fc2(self.h_layer)
#        print(self.softmax_input)
        neg_log_prob = F.cross_entropy(self.softmax_input, acts, reduce=False)
        
#        print("vt:", vt)
#        print("neg_log_prob:", neg_log_prob)
        loss = (neg_log_prob * vt).sum()
        return loss
        
class Policy_Gradient():
    def __init__(self, env):
        # init some parameters
        self.time_step = 0
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        self.model = softmax_network(env)
        self.optimizer = torch.optim.Adam(params=self.model.parameters(), lr=0.01)
        
    def choose_action(self, observation):
        prob_weights = self.model.create_softmax_network(torch.FloatTensor(observation[np.newaxis, :]))
        action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.detach().numpy().ravel())  # select action w.r.t the actions prob
        return action

    def store_transition(self, s, a, r):
        self.ep_obs.append(s)
        self.ep_as.append(a)
        self.ep_rs.append(r)

    def learn(self):

        discounted_ep_rs = np.zeros_like(self.ep_rs)
        
        running_add = 0
        for t in reversed(range(0, len(self.ep_rs))):
            running_add = running_add * GAMMA + self.ep_rs[t]
            discounted_ep_rs[t] = running_add

        discounted_ep_rs -= np.mean(discounted_ep_rs)
        discounted_ep_rs /= np.std(discounted_ep_rs)
#        print(discounted_ep_rs)
        # train on episode
        loss = self.model(torch.FloatTensor(np.vstack(self.ep_obs)), 
                          torch.LongTensor(self.ep_as),
                          torch.LongTensor(discounted_ep_rs))
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []    # empty episode data
# Hyper Parameters
ENV_NAME = 'CartPole-v0'
EPISODE = 3000 # Episode limitation
STEP = 3000 # Step limitation in an episode
TEST = 10 # The number of experiment test every 100 episode

def main():
  # initialize OpenAI Gym env and dqn agent
  env = gym.make(ENV_NAME)
  agent = Policy_Gradient(env)

  for episode in range(EPISODE):
    # initialize task
    state = env.reset()
    # Train
    for step in range(STEP):
      action = agent.choose_action(state) # e-greedy action for train
      next_state,reward,done,_ = env.step(action)
      agent.store_transition(state, action, reward)
      state = next_state
      if done:
        #print("stick for ",step, " steps")
        agent.learn()
        break

    # Test every 100 episodes
    if episode % 100 == 0:
      total_reward = 0
      for i in range(TEST):
        state = env.reset()
        for j in range(STEP):
#          env.render()
          action = agent.choose_action(state) # direct action for test
          state,reward,done,_ = env.step(action)
          total_reward += reward
          if done:
            break
      ave_reward = total_reward/TEST
      print ('episode: ',episode,'Evaluation Average Reward:',ave_reward)

if __name__ == '__main__':
  main()

参考文献:https://www.cnblogs.com/pinard/p/10137696.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
DDPG(Deep Deterministic Policy Gradient)是一种用于连续动作空间的强化学习算法,可以用于解决许多控制问题。在DDPG算法中,策略网络的参数更新需要计算policy gradient梯度。 在MATLAB中,我们可以利用深度学习工具箱来实现DDPG算法。下面是计算DDPG策略网络的policy gradient的步骤: 1. 首先,定义和初始化DDPG的网络架构,包括策略网络和值函数网络。策略网络参数化了一个确定性策略π,输入为状态s,输出为动作a。值函数网络是为了辅助策略网络的训练,输入为状态s和动作a,输出为对应的Q值。 2. 使用现有的经验回放缓冲池,从中随机选择一定数量的样本。每个样本包含当前状态s,选定的动作a,奖励r,下一状态s'以及一个指示终止状态的标志位done。 3. 对于选定的每个样本,使用策略网络计算当前状态s下的动作a,并计算其对应的Q值。 4. 将计算得到的动作a和Q值作为目标,使用值函数网络对当前状态s和动作a进行预测得到Q值的估计。 5. 利用目标Q值和估计Q值的差异,计算出policy gradient梯度。 6. 利用计算得到的梯度来更新策略网络的参数,使得策略网络的输出更适应目标Q值。 7. 重复以上步骤,直至达到收敛条件或指定的训练轮数。 以上是MATLAB中计算DDPG策略网络的policy gradient的一般步骤。具体实现还需要根据具体的问题和网络架构进行调整和优化。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值