李宏毅深度强化学习笔记（一）

最新推荐文章于 2024-05-22 18:04:37 发布

小鲨鱼的小鱼干儿

最新推荐文章于 2024-05-22 18:04:37 发布

阅读量247

点赞数 1

文章标签：强化学习机器学习 python 人工智能深度学习

本文链接：https://blog.csdn.net/weixin_39395368/article/details/113485252

版权

李宏毅深度强化学习笔记（一）

Policy Gradient

policy gradient从on policy到off policy，再加一些约束就是PPO

review policy gradient:

基础元素：Actor、Enverimrnt、Reward（后两个不是自己能控制的，能改变的只有Actor）
poliyc $\pi$ 是一个参数为 $\theta$ 的网络，输入为所观察到的环境的状态，由一个矩阵或者向量表达，输出是所有行为的概率。
an episode的total reward：
$R=\sum_{t}^{T}r_t$

Trajectory:
$\tau =\left \{s_1,a_1,s_2,a_2,...,s_T,a_T \right \}$

$p_\theta(\tau)=p(s_1)p_\theta(a_1|s_1)p(s_2|s_1,a_1)p_\theta(a_2|s_2)...=p(s_1) \prod_{t=1}^{T}p_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$

$p(s_{t+1}|s_t,a_t)$ 与环境有关，是不受我们控制的，我们能控制的是actor的 $p_\theta(a_t|s_t)$ 。
穷举所有的trajectory得到expected reward:
$\bar{R}_\theta=\sum_{\tau}R(\tau)p_\theta(\tau)=E_{\tau \sim p_\theta(\tau)}[R(\tau)]$

轨迹 $\tau$ 的概率分布为 $p_\theta(\tau)$ ， $R(\tau)$ 为 $\tau$ 这条轨迹的总reward，将根据 $p_\theta(\tau)$ 分布的 $R(\tau)$ 的平均值记作 $E_{\tau \sim p_\theta(\tau)}[R(\tau)]$ 。

目的是要expected reward最大，传统的梯度下降求最小值，只需要把参数更新公式中的减号换成加号就可以求最大值，如下：
$\theta=\theta-\alpha \bigtriangledown \theta \rightarrow \theta=\theta+\alpha \bigtriangledown \theta$

求 $\bar{R}_\theta$ 的梯度:

$\bigtriangledown_\theta \bar{R}_\theta = \sum_{\tau}R(\tau)\bigtriangledown_\theta p_\theta(\tau) =\sum_{\tau}R(\tau)p_\theta(\tau)\frac{\bigtriangledown _\theta p_\theta(\tau)}{p_\theta(\tau)}$

根据log的求导公式，可以写成：
$\begin{aligned} \bigtriangledown_\theta \bar{R}_\theta &=\sum_{\tau}R(\tau)p_\theta(\tau)\frac{\bigtriangledown _\theta p_\theta(\tau)}{p_\theta(\tau)}=\sum_{\tau}R(\tau)p_\theta(\tau)\bigtriangledown_\theta log p_\theta (\tau)\\ &=E_{\tau \sim p_\theta(\tau)}[R(\tau)\bigtriangledown_\theta log p_\theta (\tau)]\\ &=\frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\bigtriangledown_\theta log p_\theta (\tau^n) \end{aligned}$

由上面的推导可知 $p_\theta(\tau)$ ，如下式。与 $\theta$ 无关求导直接为0零了。
$p_\theta(\tau)=p(s_1) \prod_{t=1}^{T}p_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$

$logp_\theta(\tau)=log(p(s_1))+\sum_{t=1}^{T}logp_\theta(a_t|s_t)+\sum_{t=1}^{T}logp(s_{t+1}|s_t,a_t)$

所以
$\begin{aligned} \bigtriangledown_\theta \bar{R}_\theta=\frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\bigtriangledown_\theta log p_\theta (\tau^n) &= \frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\sum_{t=1}^{T_n}\bigtriangledown_\theta logp_\theta(a_t|s_t)\\ &=\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^n)\bigtriangledown_\theta logp_\theta(a_t^n|s_t^n) \end{aligned}$

所以梯度更新公式为：
$\theta\leftarrow \theta+\alpha \bigtriangledown_\theta \bar{R}_\theta$

改进方法：

Add a Baseline:
$R(\tau^n)$ 是第n个trajectory的整体reward， $R(\tau^n)$ 可以当作一个权重，减少 $R(\tau^n)$ 小的轨迹的概率。 $R(\tau^n)$ 可能总是正的，没有负reward，
$\bigtriangledown_\theta \bar{R}_\theta=\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(R(\tau^n)-b)\bigtriangledown_\theta logp_\theta(a_t^n|s_t^n)$
$b$ 可以取 $E(R(\tau))$ ，也可以取其他，理解为让一个trajectory出现的权重有正有负， $(R(\tau^n)-b)$ 为负就减少这个trajectory出现的概率。
Assgin Suitable Credit
一个episode里 $\bigtriangledown_\theta logp_\theta(a_t^n|s_t^n)$ 乘的数都是一样的都是 $(R(\tau^n)-b)$ ，对每一个 $a^t$ 不公平，使用更能代表 $a^t$ 的reward：
$\sum_{t'=t}^{T_n}r_{t'}^n$
表示从 $a_t$ 的时间开始，到整个回合结束的所有reward的和。进一步将其乘上一个折扣因子( $\gamma$ 小于1)，表示未来的 $a_t$ 的reward对在其之前的 $a_t$ 的影响，离的越远影响越小：
$\sum_{t'=t}^{T_n}\gamma ^{t'-t}r_{t'}^n$
所以梯度表达式更新如下：
$\bigtriangledown_\theta \bar{R}_\theta=\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(\sum_{t'=t}^{T_n}\gamma ^{t'-t}r_{t'}^n-b)\bigtriangledown_\theta logp_\theta(a_t^n|s_t^n)$
b放在之后再讨论，可以有不同的取法。
$(\sum_{t'=t}^{T_n}\gamma ^{t'-t}r_{t'}^n-b)$
这个式子可以叫做优势函数。

下面给出回合更新策略梯度下降代码，因为在每一个回合结束后更新，所以上面是的梯度更新式子中的对N求和就省略了，并且以下代码没有添加beseline：

import gym
import numpy as np
import matplotlib.pylab as plt
import torch.nn as nn
import torch.nn.functional as F
import torch
from torch.distributions import Categorical


class Policy(nn.Module):
    def __init__(self,s_size=4,h_size=128,a_size=2):
        super(Policy,self).__init__()

        self.affine1=nn.Linear(s_size,h_size)
        self.dropout=nn.Dropout(p=0.6)
        self.affine2=nn.Linear(h_size,a_size)
        self.saved_log_prob=[]
        self.rewards=[]
    
    def forward(self,x):
        x=self.affine1(x)
        x=self.dropout(x)
        x=F.relu(x)
        action_scores=self.affine2(x)
        return F.softmax(action_scores,dim=1)#转换成了概率


class Agent():
    def  __init__(self,episode,max_steps,gamma):
        '''
        :param episode: 回合数
        :param max_steps: 一个回合最大步数
        :param gamma: 未来回报的折扣因子
        '''
        self.episode=episode
        self.max_steps=max_steps
        self.gamma=gamma

    def decide(self,observation):
        state=torch.from_numpy(observation).float().unsqueeze(0)#需要增加一个维度，可以理解为神经网络输入的batch_size
        pro_action=policy.forward(state)
        m=Categorical(pro_action)#以pro_action的概率在pro_action的长度进行整数索引
        action=m.sample()#采样得到所要执行的真实的动作
        return action.item(),m.log_prob(action)#返回动作和采取的那个动作的对数概率值

    def learn(self):
        collect_loss=[]
        collect_reward=[]
        for i in range(1,self.episode+1):
            observation=env.reset()
            env.render()
            G=[]
            log_pro_actions=[]
            #采集一个回合的数据
            for j in range(self.max_steps):
                action,log_pro_action=self.decide(observation)#action的输出是0和1
                next_observation,reward,done,_=env.step(action)
                G.append(reward)#采集reward
                log_pro_actions.append(log_pro_action)#采集动作概率对数值
                if done:
                    break
                observation=next_observation

            collect_reward.append(np.sum(G))#收集一个回合的总reward
            #从后向前计算每一个动作的G_t
            for k in range(len(G)-1,1,-1):
                if k==len(G)-1:
                    G[k]=G[k]
                else:
                    for j in range(k+1,len(G),1):
                        G[k]=G[k]+self.gamma**(j-k)*G[j]

            #计算每一个动作的loss即对数概率乘上动作的G_t,求和得到总的loss,负loss的目的是求回报期望最大值，即求负期望的最小值，就可以利用梯度下降
            loss=[-pro*r for pro,r in zip(log_pro_actions,G)]
            loss=torch.cat(loss).sum()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            collect_loss.append(loss)

            if i%(100)==0:
                print('iteration {:d}: reward {:.4f}'.format(i,np.mean(collect_reward)))

        return collect_reward,collect_loss

env=gym.make('CartPole-v1')
env.seed(0)
# print(env.observation_space)#状态空间维数4
# print(env.action_space)#动作空间维数2
policy=Policy()
optimizer=torch.optim.Adam(policy.parameters(),lr=1e-2)
agent=Agent(episode=1000,max_steps=100,gamma=0.5)

collect_reward,collect_loss=agent.learn()
env.close()

plt.figure()
plt.plot(collect_loss)
plt.title('loss')
plt.figure()
plt.plot(collect_reward)
plt.title('reward')
plt.show()

初学，如有问题，恳请指正(抱拳)

小鲨鱼的小鱼干儿

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
李宏毅深度强化学习笔记（一）

李宏毅深度强化学习笔记（一）Proximal Policy Optimization(PPO)policy gradient从on policy到off policy，再加一些约束就是PPOreview policy gradient:基础元素：Actor、 Enverimrnt、 Reward（后两个不知自己能控制的，能改变的只有Actor）poliyc π\piπ是一个参数为θ\thetaθ的网络，输入为所观察到的环境的状态，由一个矩阵或者向量表达，输出是所有行为的概率。an episode
复制链接

扫一扫