Reinforment Learning : An Introduction 笔记 chapter 9 On-policy Prediction with Approximation

The novelty of this chapter is that the approximate value function is represented not as a table but as a parameterized function form with weight vector w ∈ R d \mathbf{w}\in \mathbb{R}^{d} wRd

What function approximation can not do is that augment the state representation with memories of past observations.

与tabular method不同的一点是:
When a single state is updated, the change generalizes from that state to affect the values of many other states. Such generalization makes the learning potentially more powerful but also potentially more difficult to manage and understand. In tabular method, the learned values at each state were decoupled — an update at one state affected no other.

‘update’ notation: s ↦ u s \mapsto u su, where s s s is the state updated and u u u is the update target that s s s's estimated value is shifted toward.

semi-gradient methods

Prediction Objective

a natural objective function, the M e a n S q u a r e d V a l u e E r r o r Mean Squared Value Error MeanSquaredValueError, donated
V E ‾ ( w ) ≐ ∑ s ∈ S μ ( s ) [ v π ( s ) − v ^ ( s , w ) ] 2 \overline{VE}(\mathbf{w}) \doteq \sum_{s\in S}\mu(s)[v_{\pi}(s) - \hat{v}(s, \mathbf{w})]^{2} VE(w)sSμ(s)[vπ(s)v^(s,w)]2
where μ ( s ) \mu(s) μ(s)------state weighting or distribution μ ( s ) ≥ 0 , ∑ s ∈ S μ ( s ) = 1 \mu(s)\geq0, \sum_{s\in S}\mu(s) =1 μ(s)0,sSμ(s)=1

Stochastic-gradient and Semi-gradient Methods

In particular, there is generally no w \mathbf{w} w that gets all the states, or even all the examples, exactly correct. In addition, we must generalize to all the other states that have not appeared in examples.

Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example (We assume that states appear in examples with the same distribution μ \mu μ):
w t + 1 ≐ w t − 1 2 α ▽ [ v π ( S t ) − v ^ ( S t , w t ) ] 2 = w t + α [ v π ( S t ) − v ^ ( S t , w t ) ] ▽ v ^ ( S t , w t ) \mathbf{w_{t+1}} \doteq \mathbf{w_t} - \frac{1}{2} \alpha \triangledown [v_\pi(S_t) - \hat{v}(S_t, \mathbf{w_t})]^2 = \mathbf{w_t}+\alpha[v_\pi(S_t)-\hat{v}(S_t, \mathbf{w_t})]\triangledown \hat{v}(S_t, \mathbf{w_t}) wt+1wt21α[vπ(St)v^(St,wt)]2=wt+α[vπ(St)v^(St,wt)]v^(St,wt)

where
α \alpha α------a positive step-size parameter
▽ f ( w ) \triangledown f(\mathbf{w}) f(w)------the vector of partial derivatives with respect to the components of the weight vector for any scalar expression

Gradient descent methods are called “stochastic” when the update is done, as here, on only a single example, which might have been selected stochastically. Over many examples, make small steps, the overall effect is to minimize an average performance measure such as the V E ‾ \overline{VE} VE

Linear Methods

Linear methods approximate state-value function by the inner product between w \mathbf{w} w and x ( s ) \mathbf{x}(s) x(s):
v ^ ( s , w ) ≐ w ⊤ x ( s ) ≐ ∑ i = 1 d w i x i ( s ) \hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^{\top}\mathbf{x}(s)\doteq\sum_{i=1}^{d}w_{i}x_{i}(s) v^(s,w)wx(s)i=1dwixi(s)
where x ( s ) ≐ ( x 1 ( s ) , x 2 ( s ) , . . . , x d ( s ) ) ⊤ \mathbf{x}(s)\doteq(x_1(s), x_2(s), ..., x_d(s))^{\top} x(s)(x1(s),x2(s),...,xd(s)), is called feature vector representing state s

Feature Construction for Linear Methods

Choosing features appropriate to the task is an import way of adding prior domain knowledge to reinforcement learning system.
相比于非线性逼近,线性逼近的好处是只有一个最优值,因此可以收敛到全局最优。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值