Reinforment Learning : An Introduction 笔记 chapter 9 On-policy Prediction with Approximation

最新推荐文章于 2022-07-12 23:25:27 发布

Vic_Hao

最新推荐文章于 2022-07-12 23:25:27 发布

阅读量252

点赞数

分类专栏：强化学习

本文链接：https://blog.csdn.net/weixin_42018112/article/details/88083889

版权

强化学习专栏收录该内容

18 篇文章 3 订阅

订阅专栏

The novelty of this chapter is that the approximate value function is represented not as a table but as a parameterized function form with weight vector $\mathbf{w}\in \mathbb{R}^{d}$

What function approximation can not do is that augment the state representation with memories of past observations.

与tabular method不同的一点是：
When a single state is updated, the change generalizes from that state to affect the values of many other states. Such generalization makes the learning potentially more powerful but also potentially more difficult to manage and understand. In tabular method, the learned values at each state were decoupled — an update at one state affected no other.

‘update’ notation: $\mapsto u$ , where $s$ is the state updated and $u$ is the update target that $s$ 's estimated value is shifted toward.

semi-gradient methods

Prediction Objective

a natural objective function, the $M e a n S q u a r e d V a l u e E r r o r$ , donated
$\overline{VE}(\mathbf{w}) \doteq \sum_{s\in S}\mu(s)[v_{\pi}(s) - \hat{v}(s, \mathbf{w})]^{2}$
where $\mu(s)$ ------state weighting or distribution $\mu(s)\geq0, \sum_{s\in S}\mu(s) =1$

Stochastic-gradient and Semi-gradient Methods

In particular, there is generally no $\mathbf{w}$ that gets all the states, or even all the examples, exactly correct. In addition, we must generalize to all the other states that have not appeared in examples.

Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example (We assume that states appear in examples with the same distribution $\mu$ ):
$\mathbf{w_{t+1}} \doteq \mathbf{w_t} - \frac{1}{2} \alpha \triangledown [v_\pi(S_t) - \hat{v}(S_t, \mathbf{w_t})]^2 = \mathbf{w_t}+\alpha[v_\pi(S_t)-\hat{v}(S_t, \mathbf{w_t})]\triangledown \hat{v}(S_t, \mathbf{w_t})$

where
$\alpha$ ------a positive step-size parameter
$\triangledown f(\mathbf{w})$ ------the vector of partial derivatives with respect to the components of the weight vector for any scalar expression

Gradient descent methods are called “stochastic” when the update is done, as here, on only a single example, which might have been selected stochastically. Over many examples, make small steps, the overall effect is to minimize an average performance measure such as the $\overline{VE}$

Linear Methods

Linear methods approximate state-value function by the inner product between $\mathbf{w}$ and $\mathbf{x}(s)$ :
$\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^{\top}\mathbf{x}(s)\doteq\sum_{i=1}^{d}w_{i}x_{i}(s)$
where $\mathbf{x}(s)\doteq(x_1(s), x_2(s), ..., x_d(s))^{\top}$ , is called feature vector representing state s

Feature Construction for Linear Methods

Choosing features appropriate to the task is an import way of adding prior domain knowledge to reinforcement learning system.
相比于非线性逼近，线性逼近的好处是只有一个最优值，因此可以收敛到全局最优。

Vic_Hao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reinforment Learning : An Introduction 笔记 chapter 9 On-policy Prediction with Approximation

The novelty of this chapter is that the approximate value function is represented not as a table but as a parameterized function form with weight vector w∈Rd\mathbf{w}\in \mathbb{R}^{d}w∈RdWhat funct...
复制链接

扫一扫