Lect6_Value_Function_Approximation

最新推荐文章于 2024-06-28 17:39:23 发布

Ricky050

最新推荐文章于 2024-06-28 17:39:23 发布

阅读量205

点赞数

分类专栏： RL_by_DavidSilver_notes 文章标签： reinforcement learning 强化学习

本文链接：https://blog.csdn.net/zzping01/article/details/120868236

版权

RL_by_DavidSilver_notes 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

Value Funtion Approximation
Action-Value Function Approximation
Batch Methods

Value Funtion Approximation

Introduction

Why need?

we have represented value function by a lookup table
- Every state s has an entry $V (s)$
- Or every state-action pair s, a has an entry $Q (s, a)$
Problem with large MDPs:
- There ate too many states and/or actions to store in memory
- It’s too slow to learn the value of each state individually

Solution for large MDPs:

Estimate value function with function approximation
$\begin{aligned} \hat{v}(s,\mathbf{w}) &\approx v_\pi(s) \\ or\ \hat{q}(s,a,\mathbf{w}) &\approx q_\pi(s,a) \end{aligned}$

Types of Value Function Approximation

在这里插入图片描述

Which Funtion Approximator?

There are many funtion approximators, but we consider differentiable fucntion approximators, e.g.

Linear combinations of features
Neural network
Decision tree
Nearest neighbor
Fourier / wavelet bases
$\dots$

Incremental Methods

Value Funtion Approx. by SGD

Goal: find parameter vector $\mathbf{w}$ Minimising mean-squared error between approximate value function $\hat{v}(s,\mathbf{w})$ and true value function $v_\pi(s)$ :
$\pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(v_\pi(S) - \hat{v}(S, \mathbf{w}))^2 \right] \tag{1}$
Gradient descent finds a local minimum:
$\Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \right]$
Expected update is equal to full gradient update👆

SGD samples the gradient:
$\Delta \mathbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \tag{2}$

Linear Funtion Approximation

Feature vectors: represent state by a feature vector
$\mathbf{x}(S) = \left( \begin{matrix} \mathbf{x}_1(S) \\ \vdots \\ \mathbf{x}_n(S) \end{matrix} \right)$
Linear ---- Represent value function by a linear combination of features:
$\hat{v}(S,\mathbf{w}) = \mathbf{x}(S)^{\text{T}}\mathbf{w} = \sum_{j=1}^n \mathbf{x}_j(S)\mathbf{w_j} = \mathbf{x}(S) \bigodot \mathbf{w} \qquad \text{点积} \tag{3}$
easy to get: $KaTeX parse error: \tag works only in display equations$

代入式（2）：
$\Delta \mathbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \mathbf{w})) \mathbf{x}(S) \\ \text{Update = learning rate * prediction error * feature value}$

可以发现表格型方法是线性价值函数估计的一种特例，其 feature vector 为：
$\mathbf{x}^{table}(S) = \left( \begin{matrix} 1(\mathbf{S}=s_1) \\ \vdots \\ 1(\mathbf{S}=s_n) \end{matrix} \right)$

$\mathbf{w}$ 中每一个元素代表一个状态的值：
$\hat{v}(S,\mathbf{w}) = \left( \begin{matrix} 1(\mathbf{S}=s_1) \\ \vdots \\ 1(\mathbf{S}=s_n) \end{matrix} \right) \cdot \left( \begin{matrix} \mathbf{w}_1 \\ \vdots \\ \mathbf{w}_n \end{matrix} \right)$

Incremental Prediction Algorithms

In RL there is no supervisor, i.e. no true value funtion $v_\pi(s)$ . In practice, we substitute a target for $v_\pi(s)$

For MC, the target is the return $G_t$
$\Delta(\mathbf{w}) = \alpha ({\color{red}G_t} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w})$
For TD(0), the target is the TD target $R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})$
$\Delta(\mathbf{w}) = \alpha ({\color{red}R_{t+1} = \gamma \hat{v}(S_{t+1}, \mathbf{w})} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w})$
For TD( $\lambda$ ), the target is the $\lambda$ -return $G_t^\lambda$
$\Delta(\mathbf{w}) = \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) \tag{4}$
- Forward view linear TD( $\lambda$ )
  $\begin{aligned} \Delta(\mathbf{w}) &= \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) \\ &= \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \mathbf{x}(S_t) \end{aligned} \tag{4.1}$
- Backward view linear TD( $\lambda$ )
  $\begin{aligned} \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \\ E_t &= \gamma \lambda E_{t-1} + \mathbf{x}(S_t) \\ \Delta \mathbf{w} &= \alpha \delta_t E_t \end{aligned} \tag{4.2}$

Control with Value Function Approximation

在这里插入图片描述

Action-Value Function Approximation

like value function Approximation
$\hat{q}(S,A,\mathbf{w}) \approx q_\pi(S,A)$

$\pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(q_\pi(S,A) - \hat{q}(S,A, \mathbf{w}))^2 \right]$

$\Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(q_\pi(S,A) - \hat{q}(S,A, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S,A, \mathbf{w}) \right]$

Linear Action-Value Function Approximation

like [linear function approximation](#linear function approximation)
$\mathbf{x}(S,A) = \left( \begin{matrix} \mathbf{x}_1(S,A) \\ \vdots \\ \mathbf{x}_n(S,A) \end{matrix} \right)$

$\hat{q}(S,A, \mathbf{w}) = \mathbf{x}(S,A)^{\text{T}}\mathbf{w} = \sum_{j=1}^n \mathbf{x}_j(S,A)\mathbf{w_j} = \mathbf{x}(S,A) \bigodot \mathbf{w} \qquad \text{点积} \tag{3}$

$\Delta \mathbf{w} = \alpha (q_\pi(S,A) - \hat{q}(S,A, \mathbf{w})) \mathbf{x}(S,A)$

Incremental Prediction Algorithms

Substitute a target for $q_\pi(S,A)$

For MC, the target is the return $G_t$
$\Delta(\mathbf{w}) = \alpha ({\color{red}G_t} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w})$
For TD(0), the target is the TD target $R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w})$
$\Delta(\mathbf{w}) = \alpha ({\color{red}R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w})} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w})$
For TD( $\lambda$ ), the target is the $\lambda$ -return $q_t^\lambda$
$\Delta(\mathbf{w}) = \alpha ({\color{red}q_t^\lambda} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) \tag{4}$
- Backward view TD( $\lambda$ )
  $\begin{aligned} \delta_t &= R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \\ E_t &= \gamma \lambda E_{t-1} + \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) \\ \Delta \mathbf{w} &= \alpha \delta_t E_t \end{aligned}$

Covergence of Prediction Algorithms

在这里插入图片描述

对于TD为什么不收敛，而 gradient-TD可以收敛的原因并不是很清楚。但原ppt是这么解释的：
TD does not follow the gradient of any objective function. This is why TD can diverge when off-policy or using non-linear function approximation. Gradient TD follows true gradient of projected Bellman error.

Covergence of Control Algorithms

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gFo76LrY-1634716831783)(/Users/huangruixin/Desktop/Master/blog/reinforcement_learning/David_Silver/Lect6_Value_Function_Approximation.assets/image-20211020154735026.png)]