Chapter 9 On-policy Prediction with Approximation

最新推荐文章于 2022-10-18 10:08:20 发布

滑稽树

最新推荐文章于 2022-10-18 10:08:20 发布

阅读量725

点赞数

分类专栏：强化学习笔记游戏AI

本文链接：https://blog.csdn.net/dengyibing/article/details/80837134

版权

强化学习笔记同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

游戏AI

10 篇文章 0 订阅

订阅专栏

本文为《Reinforcement Learning: An Introduction》读书笔记

之前学习的方法都是基于表格的方法，现在要用approximation来替换表格。

本书后面的两个重点：
1. update target
2. updte distribution

9.1 Value-function Approximation

Machine learning methods that learn to mimic input-output examples in this way are called supervised learning methods, and when the outputs are numbers, like u, the process is often called function approximation

正因为学习approximation function的过程是监督学习过程，所以介意使用任意函数来做监督学习，比如神经网络。

9.2 The Prediction Objective ( $\overline{VE}$ )

Mean Squared Value Error，即 $\overline{VE}$

V E ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ (w) ≐ \sum s \in S μ (s) [v π (s) - v^(s, w)] 2

$\overline{VE}(w) \doteq \sum_{s \in S} \mu(s)[v_{\pi}(s)-\hat{v}(s,w)]^2$
其中

v^(s,w) v ^ ( s , w ) $\hat{v}(s,w)$ 为 approximate value；

vπ(s) v π ( s ) $v_{\pi}(s)$ 为true value；

μ(s) μ ( s ) $\mu(s)$ 表示处于状态s的时间百分比，在 on-policy training中

μ(s) μ ( s ) $\mu(s)$ 被称为 on-policy distribution
这也是本章要讲的内容
The on-policy distribution in episodic tasks

对于连续任务， on-policy distribution就是在

π π $\pi$ 下的状态分布。

9.3 Stochastic-gradient and Semi-gradient Methods

在梯度下降方法中，使用权重向量 $w \doteq (w_1, w_2, \cdots, w_d)^T$ ，来approximate value function $\hat{v}(s, w)$

Stochastic gradient-descent (SGD)

w t + 1 ≐ w t - 1 2 ▽ [v π (S t) - v^(S t, w t)] 2 = w t + α [v π (S t) - v^(S t, w t)] ▽ v^(S t, w t)

$\begin{align*} w_{t+1} & \doteq w_t - \frac{1}{2} \triangledown [v_{\pi}(S_t) - \hat{v}(S_t, w_t)]^2 \\ & = w_t + \alpha[v_{\pi}(S_t)-\hat{v}(S_t, w_t)] \triangledown \hat{v}(S_t, w_t) \end{align*}$
因为

vπ(s) v π ( s ) $v_{\pi}(s)$ 并不是已知的，我们使用 unbiased estimate

Ut U t $U_t$ 来代替它。

w t + 1 ≐ w t + α [U t - v^(S t, w t)] ▽ v^(S t, w t)

$w_{t+1} \doteq w_t+\alpha[U_t-\hat{v}(S_t, w_t)]\triangledown \hat v(S_t, w_t)$
在蒙特卡洛方法中，

Ut U t $U_t$ 是可以计算出来的，

Ut≐Gt U t ≐ G t $U_t \doteq G_t$ ，所以梯度下降版本的蒙特卡洛 state-value prediction 保证能找到局部最优
Gradient Monte Carlo Algorithm for Estimating

如果使用 Bootstrapping 就不会得到这样的保证。
Bootstrapping将不会产生真正的梯度下降方法，原因是targets，如n-step returns $G_{t:t+n}$ 或DP target $\sum_{a,s^\prime, r}\pi(a|S_t)p(s^\prime,r|S_t,a)[r+\gamma \hat v(s^\prime, w_t)]$ 都依赖于权重向量 $w_t$ 的当前值，这表示target将是有偏的，所以不会产生真正的梯度下降方法。

上面说到的这种方法，因此被称为 semi-gradient methods

虽然不保证收敛，但是使用Semi-gradient可以显著提高学习速度，而且允许连续和在线学习

Semi-gradient TD(0)，使用 $U_t \doteq R_{t+1}+\gamma \hat v(S_{t+1},w_t)$
Semi-gradient TD(0) for estimating

State aggregation

状态聚合是一种简单形式的广义函数逼近，其中状态被分组在一起，每个组具有一个估计值（权重向量w的一个分量）。
状态聚类是一种特殊的SGD，这里每个state group的梯度都等于1。

9.4 Linear Methods

使用线性函数作为 approximation function，来逼近 $\hat v(s, w)$ 。其中 $x(s)$ 被称为 feature vector，用来表示状态s

v^(s, w) ≐ w T x (s) ≐ \sum i = 1 d w i x i (s)

$\hat v(s,w) \doteq w^T x(s) \doteq \sum_{i=1}^d w_ix_i(s)$

SGD

w t + 1 ≐ w t + α [U t - v^(S - t, w t)] x (S t)

$w_{t+1} \doteq w_t + \alpha[U_t - \hat v(S-t,w_t)]x(S_t)$

semi-gradient TD(0)

w t + 1 ≐ w t + α (R t + 1 + γ w T t x t + 1 - w T t x t) x t = w t + α (R t + 1 x t - x t (x t - γ x t + 1) T w t)

$\begin{align*} w_{t+1} & \doteq w_t + \alpha (R_{t+1}+\gamma w_t^Tx_{t+1}-w_t^Tx_t)x_t\\ & = w_t + \alpha(R_{t+1}x_t-x_t(x_t-\gamma x_{t+1})^Tw_t) \end{align*}$

当更新达到稳定的时候，可以得到

E [w t + 1 | w t] = w t + α (b - A w t)

$\mathbb{E}[w_{t+1}|w_t] = w_t+\alpha(b-Aw_t)$

其中

b A ≐ E [R t + 1 x t] \in R d ≐ E [x t (x t - γ x t + 1)] T \in R d \times R d

$\begin{align*} b & \doteq \mathbb{E}[R_{t+1}x_t] \in \mathbb{R}^d \\ A & \doteq \mathbb{E}[x_t(x_t-\gamma x_{t+1})]^T \in \mathbb{R}^d \times \mathbb{R}^d \end{align*}$

对于上式，因为已经达到稳定状态，所以必然收敛到权重向量 $w_{TD}$

\Rightarrow \Rightarrow b - A w T D b w T D = 0 = A w T D ≐ A - 1 b

$\begin{align*} && b - Aw_{TD} & = 0 \\ \Rightarrow && b &= Aw_{TD} \\ \Rightarrow && w_{TD} & \doteq A^{-1}b \end{align*}$

linear semi-gradient TD(0) 被证明是收敛到 TD fixed point的，书上有详细证明。而且在 TD fiexed point，在连续的情形下 $\overline{VE}$ 被证明是在可能的最低误差的有限扩展范围内

V E ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ (w T D) \leq! 1 1 - γ min w V E ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ (w)

$\overline{VE}(w_{TD}) \leq !\frac{1}{1-\gamma} \underset{w}{\min} \overline{VE}(w)$

One-step semi-gradient action-value 方法，如 semi-gradient Sarsa(0)也收敛于类似的fixed point和类似的bound。

对于其他更新分布，使用function approximation的bootstrapping方法实际上可能会发散到无穷大。

n-step semi-gradient TD for estimating

w t + n ≐ w t + n - 1 + α [G t : t + n - v^(S t, w t + n - 1)] \nabla v^(S t, w t + n - 1), 0 \leq t \leq T,

$w_{t+n} \doteq w_{t+n-1}+\alpha[G_{t:t+n}-\hat{v}(S_t,w_{t+n-1})]\nabla \hat v(S_t, w_{t+n-1}), \qquad 0 \le t \le T,$

G t : t + n ≐ R t + 1 + γ R t + 2 + \dots + γ n - 1 R t + n + γ n v^(S t + n, w t + n - 1), 0 \leq t \leq T - n

$G_{t:t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1}R_{t+n}+\gamma^n \hat{v}(S_{t+n},w_{t+n-1}), \qquad 0 \le t \le T-n$

9.5 Feature Construction for Linear Methods

A limitation of the linear form is that it cannot take into account any interactions between features, such as the presence of feature i being good only in the absence of feature j.

9.5.1 Polynomials

使用多项式逼近

x (s) = (1, s 1, s 2, s 1 s 2, s 21, s 22, s 1 s 22, s 21 s 2, s 21 s 22) T

$x(s) = (1, s_1, s_2, s_1s_2, s_1^2, s_2^2, s_1s_2^2, s_1^2s_2, s_1^2s_2^2)^T$
把上面的两个变量扩展到 k 个，则

s i (s) = \prod j = 1 k s c i, j j

$s_i(s) = \prod_{j=1}^k s_j^{c_{i,j}}$
这些特征构成了n阶k维多项式，包含

(n+1)k ( n + 1 ) k $(n+1)^k$ 个不同的特征

9.5.2 Fourier Basis

s = (s 1, s 2, \dots, s k) T x i (s) = cos (π s T c i)

$s = (s_1, s_2, \cdots, s_k)^T \\ x_i(s) = \cos(\pi s^T c^i)$

其中 $c^i=(c_1^i, \cdots, c_k^i)$ , 有 $c_j^i \in {0, \cdots, n}$ for $j = 1, \cdots, k$ and $i = 0, \cdots ,(n+1)^k$

9.5.3 Coarse Coding

粗糙编码。表示有重叠的特征的状态。对每个训练状态，包含这个状态的所有圆相当于是特征的权重，区别在在于它们的重叠
Coarse coding
用圆来代表特征，一个状态（落到其中的某个点）与很多圆相交，就受到这么多状态的影响。学习的时候也会影响该点所涉及的所有状态

9.5.4 Tile Coding

如下图所示，每个这样的划分被称为 tiling，每个划分中的元素被称为 tile。
在只有一个tiling的时候只有利用了state aggregation，只有使用multiple tiling才能使用到Coarse Coding的优势。从下图看出，每个tiling都有一个偏移。这里通常设置step-size为 $\alpha=1/n$
Multiple, overlapping grid-tilings
但是有个问题，当所有的offset都是一样的时候会有artifact产生，所以可以使用非对称的偏移来避免这个问题。w表示tile width，n表示tilings的数量，w/n表示基本单元。

In particular, for a continuous space of dimension k, a good choice is to use the first odd integers (1,3,5,7,…,2k-1), with n (the number of tilings) set to an integer power of 2 greater than or equal to 4k.

Why tile asymmetrical offsets are preferred

9.5.5 Radial Basis Functions

Radial basis functions (RBFs)

x i (s) ≐ exp (- | | s - c i | | 2 2 δ 2 i)

$x_i(s) \doteq \exp(-\frac{||s-c_i||^2}{2\delta_i^2})$

One-dimensional radial basis functions

9.6 Selecting Step-Size Parameters Manually

A good rule of thumb for setting the step-size parameter of linear SGD methods is then

α ≐ (τ E [x T x]) - 1

$\alpha \doteq (\tau \mathbb{E}[x^Tx])^{-1}$

9.7 Nonlinear Function Approximation: Arti cial Neural Networks

使用神经网络来做函数逼近

9.8 Least-Squares TD

上面降到的方法都是迭代法，现在用代数的方式一次算出结果。而且LSTD的数据利用率高

计算出A和b，然后一次性计算出 $w_t$ 。跟上面讲的TD(0)一样有fixed point

Least-Squares TD algorithm

A^t ≐ \sum k = 0 t - 1 x k (x k - γ x k + 1) T + ε I b t^≐ \sum k = 0 t - 1 R t + 1 x k

$\hat{A}_t \doteq \sum_{k=0}^{t-1}x_k(x_k-\gamma x_{k+1})^T+\varepsilon I\\ \hat{b_t} \doteq \sum_{k=0}^{t-1} R_{t+1}x_k$

最终有

w t ≐ A^- 1 t b^t

$w_t \doteq \hat{A}_t^{-1} \hat{b}_t$

但是LSTD的计算复杂性很高，这里A在最后计算的时候是要求逆的，需要时间复杂度是 $O(d^3)$ ，但是这里对A的计算是外积加和形式，所以可以写成增量更新的形式，计算复杂度变成了 $O(d^2)$

A^- 1 t = (A^t - 1 + x t (x t - γ x t + 1) T) - 1 = A^- 1 t - 1 - A ^ - 1 t - 1 x t ( x t - γ x t + 1 ) T A ^ - 1 t - 1 1 + ( x t - γ x t + 1 ) T A ^ - 1 t - 1 x t

$\begin{align*} \hat{A}_t^{-1} &= (\hat{A}_{t-1}+x_t(x_t-\gamma x_{t+1})^T)^{-1}\\ & = \hat{A}_{t-1}^{-1}-\frac{\hat{A}_{t-1}^{-1}x_t(x_t-\gamma x_{t+1})^T\hat{A}_{t-1}^{-1}}{1+(x_t-\gamma x_{t+1})^T \hat{A}_{t-1}^{-1}x_t} \end{align*}$

LSTD for estimating
all with only $O(d^2)$ memory and per-step computation

9.9 Memory-based Function Approximation

非参数化方法。把训练样本保存在内存中。当有查询需求的时候，取出一组样本用来计算查询状态的估计值。也称作lazy learning。这里主要说local-learning，只使用查询状态的邻居来估计查询状态的值。其中有nearest neighbor method；weighted average methods；locally weighted regression。主要的好处是不受预先定义的函数形式的约束。主要的问题是加速最近邻居的查询，可以使用 k-d tree来加速。

9.10 Kernel-based Function Approximation

kernel function根据memory中状态与查询状态的距离，给予状态不同的权重。核方法还是Memory-based methods。
Kernel regression

v^(s, D) = \sum s' \in D k (s, s') g (s')

$\hat{v}(s,D) = \sum_{s^\prime \in D} k(s,s^\prime)g(s^\prime)$
D is the set of stored examples
g(s’) denotes the target for state s’ in a stored example

线性参数回归方法也可以被看做是核方法，只不过使用的是线性核函数

k (s, s') = x (s) T x (s')

$k(s,s^\prime) = x(s)^Tx(s^\prime)$

“kernel trick”：它允许在扩展特征的高维空间中有效地工作，而实际上仅使用一组存储的训练示例。

9.11 Looking Deeper at On-policy Learning: Interest and Emphasis

现在的算法对于遇到的所有状态都给予相同的重视程度。但事实上，我们会更加重视某些states。即对不同的feature有不同的关注度。

interest $I_t$ ：非负的随机变量，表示在时刻t我们对准确估计状态的感兴趣的程度
Emphasis $M_t$ ：用来强调和弱化t时刻的学习

The general n-step learning rule

w t + n ≐ w t + n - 1 + α M t [G t : t + n - v^(S - t, w t + n - 1)] \nabla v^(S t, w t + n - 1), 0 \leq t \leq T M t = I t + γ n M t - n, 0 \leq t < T

$w_{t+n} \doteq w_{t+n-1}+\alpha M_t[G_{t:t+n}-\hat{v}(S-t,w_{t+n-1})]\nabla \hat v(S_t, w_{t+n-1}), \qquad 0 \le t \le T\\ M_t = I_t+\gamma^n M_{t-n}, \qquad 0 \le t \lt T$