Lect6_Value_Function_Approximation

Value Funtion Approximation

Introduction

Why need?

  • we have represented value function by a lookup table
    • Every state s has an entry V ( s ) V(s) V(s)
    • Or every state-action pair s, a has an entry Q ( s , a ) Q(s,a) Q(s,a)
  • Problem with large MDPs:
    • There ate too many states and/or actions to store in memory
    • It’s too slow to learn the value of each state individually

Solution for large MDPs:

  • Estimate value function with function approximation
    v ^ ( s , w ) ≈ v π ( s ) o r   q ^ ( s , a , w ) ≈ q π ( s , a ) \begin{aligned} \hat{v}(s,\mathbf{w}) &\approx v_\pi(s) \\ or\ \hat{q}(s,a,\mathbf{w}) &\approx q_\pi(s,a) \end{aligned} v^(s,w)or q^(s,a,w)vπ(s)qπ(s,a)

Types of Value Function Approximation

在这里插入图片描述

Which Funtion Approximator?

There are many funtion approximators, but we consider differentiable fucntion approximators, e.g.

  • Linear combinations of features
  • Neural network
  • Decision tree
  • Nearest neighbor
  • Fourier / wavelet bases
  • … \dots

Incremental Methods

Value Funtion Approx. by SGD

Goal: find parameter vector w \mathbf{w} w Minimising mean-squared error between approximate value function v ^ ( s , w ) \hat{v}(s,\mathbf{w}) v^(s,w) and true value function v π ( s ) v_\pi(s) vπ(s) :
J ( w ) = E π [ ( v π ( S ) − v ^ ( S , w ) ) 2 ] (1) \pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(v_\pi(S) - \hat{v}(S, \mathbf{w}))^2 \right] \tag{1} JJJ(w)=Eπ[(vπ(S)v^(S,w))2](1)
Gradient descent finds a local minimum:
Δ w = − 1 2 α ∇ w J ( w ) = α E π [ ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) ] \Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \right] Δw=21αwJJJ(w)=αEπ[(vπ(S)v^(S,w))wv^(S,w)]
Expected update is equal to full gradient update👆

SGD samples the gradient:
Δ w = α ( v π ( S ) − v ^ ( S , w ) ) ∇ w v ^ ( S , w ) (2) \Delta \mathbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \tag{2} Δw=α(vπ(S)v^(S,w))wv^(S,w)(2)

Linear Funtion Approximation

Feature vectors: represent state by a feature vector
x ( S ) = ( x 1 ( S ) ⋮ x n ( S ) ) \mathbf{x}(S) = \left( \begin{matrix} \mathbf{x}_1(S) \\ \vdots \\ \mathbf{x}_n(S) \end{matrix} \right) x(S)=x1(S)xn(S)
Linear ---- Represent value function by a linear combination of features:
v ^ ( S , w ) = x ( S ) T w = ∑ j = 1 n x j ( S ) w j = x ( S ) ⨀ w 点积 (3) \hat{v}(S,\mathbf{w}) = \mathbf{x}(S)^{\text{T}}\mathbf{w} = \sum_{j=1}^n \mathbf{x}_j(S)\mathbf{w_j} = \mathbf{x}(S) \bigodot \mathbf{w} \qquad \text{点积} \tag{3} v^(S,w)=x(S)Tw=j=1nxj(S)wj=x(S)w点积(3)
easy to get: KaTeX parse error: \tag works only in display equations

代入式(2):
Δ w = α ( v π ( S ) − v ^ ( S , w ) ) x ( S ) Update = learning rate * prediction error * feature value \Delta \mathbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \mathbf{w})) \mathbf{x}(S) \\ \text{Update = learning rate * prediction error * feature value} Δw=α(vπ(S)v^(S,w))x(S)Update = learning rate * prediction error * feature value


可以发现 表格型方法 是 线性价值函数估计 的一种特例,其 feature vector 为:
x t a b l e ( S ) = ( 1 ( S = s 1 ) ⋮ 1 ( S = s n ) ) \mathbf{x}^{table}(S) = \left( \begin{matrix} 1(\mathbf{S}=s_1) \\ \vdots \\ 1(\mathbf{S}=s_n) \end{matrix} \right) xtable(S)=1(S=s1)1(S=sn)

w \mathbf{w} w 中每一个元素代表一个状态的值:
v ^ ( S , w ) = ( 1 ( S = s 1 ) ⋮ 1 ( S = s n ) ) ⋅ ( w 1 ⋮ w n ) \hat{v}(S,\mathbf{w}) = \left( \begin{matrix} 1(\mathbf{S}=s_1) \\ \vdots \\ 1(\mathbf{S}=s_n) \end{matrix} \right) \cdot \left( \begin{matrix} \mathbf{w}_1 \\ \vdots \\ \mathbf{w}_n \end{matrix} \right) v^(S,w)=1(S=s1)1(S=sn)w1wn

Incremental Prediction Algorithms

In RL there is no supervisor, i.e. no true value funtion v π ( s ) v_\pi(s) vπ(s). In practice, we substitute a target for v π ( s ) v_\pi(s) vπ(s)

  • For MC, the target is the return G t G_t Gt
    Δ ( w ) = α ( G t − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}G_t} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) Δ(w)=α(Gtv^(St,w))wv^(St,w)

  • For TD(0), the target is the TD target R t + 1 + γ v ^ ( S t + 1 , w ) R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) Rt+1+γv^(St+1,w)
    Δ ( w ) = α ( R t + 1 = γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}R_{t+1} = \gamma \hat{v}(S_{t+1}, \mathbf{w})} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) Δ(w)=α(Rt+1=γv^(St+1,w)v^(St,w))wv^(St,w)

  • For TD( λ \lambda λ), the target is the λ \lambda λ-return G t λ G_t^\lambda Gtλ
    Δ ( w ) = α ( G t λ − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) (4) \Delta(\mathbf{w}) = \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) \tag{4} Δ(w)=α(Gtλv^(St,w))wv^(St,w)(4)

    • Forward view linear TD( λ \lambda λ)
      Δ ( w ) = α ( G t λ − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) = α ( G t λ − v ^ ( S t , w ) ) x ( S t ) (4.1) \begin{aligned} \Delta(\mathbf{w}) &= \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) \\ &= \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \mathbf{x}(S_t) \end{aligned} \tag{4.1} Δ(w)=α(Gtλv^(St,w))wv^(St,w)=α(Gtλv^(St,w))x(St)(4.1)

    • Backward view linear TD( λ \lambda λ)
      δ t = R t + 1 + γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) E t = γ λ E t − 1 + x ( S t ) Δ w = α δ t E t (4.2) \begin{aligned} \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \\ E_t &= \gamma \lambda E_{t-1} + \mathbf{x}(S_t) \\ \Delta \mathbf{w} &= \alpha \delta_t E_t \end{aligned} \tag{4.2} δtEtΔw=Rt+1+γv^(St+1,w)v^(St,w)=γλEt1+x(St)=αδtEt(4.2)

Control with Value Function Approximation

在这里插入图片描述

Action-Value Function Approximation

like value function Approximation
q ^ ( S , A , w ) ≈ q π ( S , A ) \hat{q}(S,A,\mathbf{w}) \approx q_\pi(S,A) q^(S,A,w)qπ(S,A)

J ( w ) = E π [ ( q π ( S , A ) − q ^ ( S , A , w ) ) 2 ] \pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(q_\pi(S,A) - \hat{q}(S,A, \mathbf{w}))^2 \right] JJJ(w)=Eπ[(qπ(S,A)q^(S,A,w))2]

Δ w = − 1 2 α ∇ w J ( w ) = α E π [ ( q π ( S , A ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] \Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(q_\pi(S,A) - \hat{q}(S,A, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S,A, \mathbf{w}) \right] Δw=21αwJJJ(w)=αEπ[(qπ(S,A)q^(S,A,w))wq^(S,A,w)]

Linear Action-Value Function Approximation

like [linear function approximation](#linear function approximation)
x ( S , A ) = ( x 1 ( S , A ) ⋮ x n ( S , A ) ) \mathbf{x}(S,A) = \left( \begin{matrix} \mathbf{x}_1(S,A) \\ \vdots \\ \mathbf{x}_n(S,A) \end{matrix} \right) x(S,A)=x1(S,A)xn(S,A)

q ^ ( S , A , w ) = x ( S , A ) T w = ∑ j = 1 n x j ( S , A ) w j = x ( S , A ) ⨀ w 点积 (3) \hat{q}(S,A, \mathbf{w}) = \mathbf{x}(S,A)^{\text{T}}\mathbf{w} = \sum_{j=1}^n \mathbf{x}_j(S,A)\mathbf{w_j} = \mathbf{x}(S,A) \bigodot \mathbf{w} \qquad \text{点积} \tag{3} q^(S,A,w)=x(S,A)Tw=j=1nxj(S,A)wj=x(S,A)w点积(3)

Δ w = α ( q π ( S , A ) − q ^ ( S , A , w ) ) x ( S , A ) \Delta \mathbf{w} = \alpha (q_\pi(S,A) - \hat{q}(S,A, \mathbf{w})) \mathbf{x}(S,A) Δw=α(qπ(S,A)q^(S,A,w))x(S,A)

Incremental Prediction Algorithms

Substitute a target for q π ( S , A ) q_\pi(S,A) qπ(S,A)

  • For MC, the target is the return G t G_t Gt
    Δ ( w ) = α ( G t − q ^ ( S t , A t , w ) ) ∇ w q ^ ( S t , A t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}G_t} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) Δ(w)=α(Gtq^(St,At,w))wq^(St,At,w)

  • For TD(0), the target is the TD target R t + 1 + γ q ^ ( S t + 1 , A t + 1 , w ) R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w}) Rt+1+γq^(St+1,At+1,w)
    Δ ( w ) = α ( R t + 1 + γ q ^ ( S t + 1 , A t + 1 , w ) − q ^ ( S t , A t , w ) ) ∇ w q ^ ( S t , A t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w})} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) Δ(w)=α(Rt+1+γq^(St+1,At+1,w)q^(St,At,w))wq^(St,At,w)

  • For TD( λ \lambda λ), the target is the λ \lambda λ-return q t λ q_t^\lambda qtλ
    Δ ( w ) = α ( q t λ − q ^ ( S t , A t , w ) ) ∇ w q ^ ( S t , A t , w ) (4) \Delta(\mathbf{w}) = \alpha ({\color{red}q_t^\lambda} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) \tag{4} Δ(w)=α(qtλq^(St,At,w))wq^(St,At,w)(4)

    • Backward view TD( λ \lambda λ)
      δ t = R t + 1 + γ q ^ ( S t + 1 , A t + 1 , w ) − q ^ ( S t , A t , w ) E t = γ λ E t − 1 + ∇ w q ^ ( S t , A t , w ) Δ w = α δ t E t \begin{aligned} \delta_t &= R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \\ E_t &= \gamma \lambda E_{t-1} + \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) \\ \Delta \mathbf{w} &= \alpha \delta_t E_t \end{aligned} δtEtΔw=Rt+1+γq^(St+1,At+1,w)q^(St,At,w)=γλEt1+wq^(St,At,w)=αδtEt

Covergence of Prediction Algorithms

在这里插入图片描述

在这里插入图片描述

对于TD为什么不收敛,而 gradient-TD可以收敛的原因并不是很清楚。但原ppt是这么解释的:
TD does not follow the gradient of any objective function. This is why TD can diverge when off-policy or using non-linear function approximation. Gradient TD follows true gradient of projected Bellman error.

Covergence of Control Algorithms

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gFo76LrY-1634716831783)(/Users/huangruixin/Desktop/Master/blog/reinforcement_learning/David_Silver/Lect6_Value_Function_Approximation.assets/image-20211020154735026.png)]

Batch Methods

说白了就是之前 control 算法都是拿一个数据就更新一次,样本利用效率低,更新过程数据相关性强。

于是现在变成把一定量的数据存放起来,然后每次从其中随机 sample a btach of datas,用SGD进行梯度更新。目标函数还是真实值和估计值之间误差的平方

具体的例子可以看之前的两篇文章DQNDDQN

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值