强化学习RL 03: Policy-based Reinforcement Learning

天狼啸月1990

已于 2024-01-16 13:10:34 修改

阅读量484

点赞数

分类专栏： # 强化学习RL 文章标签：强化学习 policy based RL

于 2023-02-27 18:17:26 首次发布

本文链接：https://blog.csdn.net/qq_33419476/article/details/129247155

版权

强化学习RL 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

用一个神经网络来近似策略函数π。--> policy network

policy network策略网络可以理解为与整体状态好坏有关，而state-value function Vπ(s)就是用来评价整体胜算/形势，其梯度 $\frac{\partial V_{\pi}}{\partial \theta}$ 用来计算policy gradient更新θ。

1. Policy-based Reinforcement Learning

1.1 Policy Network π(a|s, θ)

1.1.1 review: state-value function approximation

1.2 policy learning的主要思想

1.3 Policy Gradient ascent

1.3.1 policy gradient form 1

1.3.2 policy gradient form 2

1.3.3 summary: policy gradient forms

1.3.4 calculate policy gradient for discrete actions

1.3.5 calculate policy gradient for continuous actions

1.4 Update policy network using policy gradient

1.4.1 REINFORCE算法

1.4.2 approximate Qπ using a neural network

2. Extension: Policy Gradient with Baseline

2.1 Policy Gradient with Baseline

2.1.1 Policy Gradient

2.1.2 Baseline

2.1.3 policy gradient with baseline

2.1.4 Monte Carlo Approximation

2.1.5 Stochastic Policy Gradient

2.1.6 Choices of Baselines

2.2 REINFORCE with Baseline

2.2.1 REINFORCE Algorithm

2.2.2 Policy and Value Networks

2.2.3 Train Policy Network with REINFORCE

2.2.3 Train Value Network with Regression Method

2.2.4 Summary

2.3 Advantage Actor-Critic (A2C)

2.3.1 Actor and Critic

2.3.2 Training of A2C

2.4 REINFORCE vs A2C

2.4.1 区别

2.4.2 A2C with Multi-step TD Target

2.4.3 REINFORCE with Baseline

2.4.4 A2C versus REINFORCE

3. Extension: Trust Region Policy Optimization (TRPO)

3.1 Optimization Basics

3.1.1 Gradient Ascent

3.1.2 Trust Region 置信域

3.1.3 Trust Region Algorithms

3.2 Trust Region Policy Optimization (TRPO)

3.2.1 TRPO Objective Function

3.2.2 TRPO Steps

3.2.3 TRPO Summary

3.2.4 Policy Gradient versus TRPO

参考

1. Policy-based Reinforcement Learning

1.1 Policy Network π(a|s, θ)

有了policy function π，我们就可以控制agent自由运动。

policy function π(a|s) is a probability density function (PDF)，其概率和为1.

input: 当前状态s；output: the probabilities of all the actions.

then how to get such a policy function π?

用一个神经网络来近似策略函数π。--> policy network

θ是神经网络参数，用policy gradient学习来改进θ。

1.1.1 review: state-value function approximation

Ut: Discounted Return，从t时刻开始，未来所有奖励R的加权求和。

未来的奖励都没法观测到，用随机变量R表示，R来源于前一动作的动作A和状态S。

动作actions的随机性来自策略函数π；状态states的随机性来自状态转移函数p。

Action-value function

Qπ(st, at)是Ut的条件期望，它把t时刻以后的动作Ai和状态Si都积分掉了，用来评价在状态st的情况下做出动作at的好坏情况。

State-Value Function

Vπ(st)是Qπ的期望，它把Qπ(st,A)中的动作A积掉，用来评价当前状态和策略的好坏、胜算多大。

用神经网络 Approximate state-value function Vπ(st; θ)

approximate state-value function Vπ(st): $V(st; \theta) = \sum_{a} \pi(a|s_{t}; \theta) \cdot Q_{\pi}(s_{t}, a)$

1.2 policy learning的主要思想

用policy network去近似policy function π，then state-value function就可以写成 $V(s; \theta)$ ，V可以评价policy function和状态s的好坏。

给定状态s，policy function越好，Vπ就越大。 --》如何让policy function越来越好呢？

可以改进模型参数，让V(s; θ)变大。

--》把目标函数定义为V(s; θ)的期望： $J(\theta) = E_{S}[V(S; \theta)]$ 。这个期望是关于S求得，S是一个随机变量，通过期望消掉后，就只剩下θ了。--》目标函数J(θ)就成了对policy network的评价，策略网络越好，J(θ)就越大。

--》policy-based learning 目标就是改进θ，使得J(θ)越大越好。

then how to improve θ？--》policy gradient ascent 策略梯度上升算法

1.3 Policy Gradient ascent

observe state s. agent玩游戏，每一步都能观测到一个不同的s，就是从s的概率分布中随机抽样得到的。
Update policy by: $\theta \leftarrow \theta + \beta \cdot \frac{\partial V(s; \theta)}{\partial \theta}$ 。随机梯度上升《--因为我们希望总收益越来越大。

1.3.1 policy gradient form 1

连加的求导=求导的连加

在实际应用中，通常不会用这个公式来算策略梯度。

实际用的是policy gradient的Monte-Carlo approximation。

1.3.2 policy gradient form 2

圈出来的那一项 * π 关于a做连加 = 圈出来的那一项关于A求期望

1.3.3 summary: policy gradient forms

1.3.4 calculate policy gradient for discrete actions

对于离散动作，可以用第一种policy gradient公式。

1.3.5 calculate policy gradient for continuous actions

对于连续动作，用第二种policy gradient公式。

因为A是连续变量，所以想要求期望就需要做定积分，但是积分是做不到的，因为π函数是个非常复杂的神经网络，没有数学公式可以直接计算复杂积分。

Monte-Carlo Approximation，去近似计算这个期望。

从policy function π中随机抽取一个具体的动作 $\hat{a}$ ，用来计算g( $\hat{a}$ , θ)，作为policy gradient的monte-carlo近似。

1.4 Update policy network using policy gradient

因为要去更新θ，就需要这一步动作action好坏情况做梯度上升，Qπ是未知的随机变量，那么Qπ的预估 = 对Qπ求期望E(Qπ) = Vπ，即Vπ对θ求导 = 梯度gradient = g(at, θt)，然后就可以进行梯度更新了。

但是action-value function是未知的，then how to approximate qt？

一种是Reinforce算法。
另一种是approximate Qπ using a neural network.

1.4.1 REINFORCE算法

用policy network π来控制agent运动。

这种方法必须玩完一局游戏，才能估计qt。

不用玩完一局游戏的方法，是用神经网络去近似Qπ函数。

1.4.2 approximate Qπ using a neural network

我们上面用一个神经网络近似policy function π，这里用另一个神经网络去近似Qπ。

这样就得到了两个神经网络的 actor-critic method。

2. Extension: Policy Gradient with Baseline

往policy gradient中加入baseline，可以降低方差，让收敛更快。

2.1 Policy Gradient with Baseline

2.1.1 Policy Gradient

2.1.2 Baseline

2.1.3 policy gradient with baseline

baseline b不会影响正确性, then why do we add b to the formula?

因为算法中用的是上述公式的蒙特卡洛近似, b不会影响期望但会影响Monte Carlo Approximation.

2.1.4 Monte Carlo Approximation

实际训练策略网络的时候，大家用的都是随机策略梯度g(at)。

2.1.5 Stochastic Policy Gradient

2.1.6 Choices of Baselines

Choice 1: b=0

Choice 2: b is state-value

why using such a baseline? 如果baseline b很接近Qπ，那么用蒙特卡洛近似这个期望的时候，方差会减小，让算法收敛更快。

2.2 REINFORCE with Baseline

2.2.1 REINFORCE Algorithm

把状态价值函数Vπ作为policy gradient中的baseline。

拿观测到的回报ut来近似Qπ，拿观测值近似期望，这也是蒙特卡洛近似，这种算法称为REINFORCE Algorithm.

Observing the trajectory: st, at, rt, st+1, at+1, rt+1,..., sn, an, rn.
Compute return: $u_{t} = \sum_{i=t}^{n} \gamma^{i-t} \cdot r_{i}$
ut is an unbiased estimate of Qπ(st, at)
Approximate V(s; θ) by the value network, v(s; w)

Three approximations:

Approximate expectation using one sample, g(at). (Monte Carlo.)
Approximate Qπ(st, at) by ut. (Another Monte Carlo.)
Approximate Vπ(s) by the value network, v(s;w)

2.2.2 Policy and Value Networks

Policy Network

目标: Approximate policy function, π(a|s), by policy network, π(a|s; θ)

Value Network

Approximate state-value, Vπ(s), by value network, v(s; w)

Parameter Sharing

2.2.3 Train Policy Network with REINFORCE

--》 $\theta \leftarrow \theta - \beta \cdot \delta_{t} \cdot \frac{\partial ln \pi(a_{t}|s_{t};\theta)}{\partial \theta}$

2.2.3 Train Value Network with Regression Method

2.2.4 Summary

2.3 Advantage Actor-Critic (A2C)

A2C = Actor-Critic with Baseline

2.3.1 Actor and Critic

2.3.2 Training of A2C

下面是上述公式的数学推导：

Properties of Value Functions

Monte Carlo Approximations

The A2C method will use these two expectations, but it's too hard to calculate them, so we use Monte Carlo Method to approximate them.

Advantage Actor-Critic的关键就在于Qπ(st,at)的蒙特卡洛近似公式。

Updating Policy Network

Policy Gradient with Baseline: $g(a_{t}) = \frac{\partial ln \pi (a_{t}|s_{t};\theta)}{\partial \theta} \cdot (Q_{\pi}(s_{t},a_{t})-V_{\pi}(s_{t}))$

Advantage function: $Q_{\pi}(s_{t},a_{t}) - V_{\pi}(s_{t})$ .

我们不知道公式里的Qπ和Vπ，所以无法直接算出随机梯度 --》所以要对它们做近似: MC approximate to Action-value; Value function v(s;w) approximate to State-value Vπ(s)

Approximate stochastic policy gradient: $g(a_{t}) \approx \frac{\partial ln \pi (a_{t}|s_{t};\theta)}{\partial \theta} \cdot (r_{t} + \gamma \cdot v_{\pi}(s_{t+1};w) - v_{\pi}(s_{t};w))$

Updating Value Network

Derive TD Target

MC approximation: $V_{\pi}(s_{t}) \approx r_{t} + \gamma \cdot V_{\pi}(s_{t+1})$

do a value network approximation: $v(s_{t};w) \approx r_{t} + \gamma \cdot v(s_{t+1};w)$

Summary: Approximate Policy Gradient

价值网络做出的判断，它评价值动作at的好坏，可以指导策略网络做改进，所以被称作critic。

Approximate to E[Ut|st]，它是价值网络的输出，它是价值网络在t时刻做出的预测，基于t时刻的状态st，预测了t时刻的回报Ut，在一局游戏结束之前，Ut是未知的，它可以评价状态st的好坏。

Approximate to E[Ut|st, st+1]，它是价值网络在t+1时刻对回报Ut的预测，价值网络基于两个状态st、st+1来预测回报Ut，这部分也是对st的评价。

这两项都是对回报Ut的期望做的近似，都能评价t时刻状态st的好坏。

是t时刻做出的预测，是在执行动作at之前做出的， independent of at。

是在t+1时刻做出的预测，它受动作at的影响，depends on at。

If at is good, their difference is positive. 所以两者差可以反映出动作at带来的优势，两者的差就叫作advantage，即evaluation make by the critic.

2.4 REINFORCE vs A2C

2.4.1 区别

都需要策略网络和价值网络，它们用的神经网络长得完全一样。

虽然它们用的神经网络一样，但它们的价值网络的功能有所不同。

A2C的价值网络叫作critic，用来评价actor的表现；而REINFROCE中的价值网络仅仅是个baseline，不会评价动作的好坏，baseline唯一的用处就是降低随机梯度造成的方差。

2.4.2 A2C with Multi-step TD Target

multi-step TD target实际效果要比one-step TD target要好。

2.4.3 REINFORCE with Baseline

ut是真实观测到的，它跟TD target不一样！！

2.4.4 A2C versus REINFORCE

REINFORCE其实是A2C的一种特例！

如果用所有的奖励r去计算multi-step TD target，也就是说不做bootstrapping，那么A2C就变成了REINFORCE。

3. Extension: Trust Region Policy Optimization (TRPO)

TRPO 置信域策略优化，比policy gradient计算量更大，但是它表现更稳定、收敛更快。

3.1 Optimization Basics

3.1.1 Gradient Ascent

数值优化：优化模型、目标函数、约束条件

最大化问题: Find $\theta^{*} = \underset{\theta}{argmax} J(\theta)$

J(θ)是目标函数。
θ是优化变量，最优解记作θ*。

我们想用数值算法寻找θ*，而梯度上升是最简单的数值算法。

Gradient ascent repeats: 算法重复这两步，直到算法的二半数接近零。
- at θold, compute gradient $g = \frac{\partial J(\theta)}{\partial \theta}|_{\theta = \theta_{old}}$ .
- Gradient ascent: $\theta_{new} \leftarrow \theta_{old} + \alpha \cdot g$ .

使用梯度上升的前提是知道目标函数J(θ)关于变量θ的梯度，但有些情况下，梯度是算不出来的。

e.g. Assume $J(\theta) = E_{S}[V(S; \theta)]$

随机梯度Stochastic gradient ascent repeats, 随机梯度上升是对期望的Monte Carlo Approximation。

s <-- random sampling
at θold, compute gradient $g = \frac{\partial V(s;\theta)}{\partial \theta}|_{\theta = \theta_{old}}$ .
Gradient ascent: $\theta_{new} \leftarrow \theta_{old} + \alpha \cdot g$ .

3.1.2 Trust Region 置信域

最大化问题: Find $\theta^{*} = \underset{\theta}{argmax} J(\theta)$

Let $\mathbb{N} (\theta_{old})$ be a neighborhood of $\theta_{old}$ , 邻域半径△.
If we have a function, $L(\theta | \theta_{old})$ , that well approximates J(θ) in $\mathbb{N} (\theta_{old})$ , then $\mathbb{N} (\theta_{old})$ is called "trust region".

3.1.3 Trust Region Algorithms

idea: 在 $\mathbb{N} (\theta_{old})$ 邻域上，人为构造的函数 $L(\theta | \theta_{old})$ 非常接近目标函数J(θ)，于是可以拿L代替J，并在 $\mathbb{N} (\theta_{old})$ 邻域中寻找L的最大值，由于L和J很接近，所以能最大化L的点也能让J变大。