Value Funtion Approximation
Introduction
Why need?
- we have represented value function by a lookup table
- Every state s has an entry V ( s ) V(s) V(s)
- Or every state-action pair s, a has an entry Q ( s , a ) Q(s,a) Q(s,a)
- Problem with large MDPs:
- There ate too many states and/or actions to store in memory
- It’s too slow to learn the value of each state individually
Solution for large MDPs:
- Estimate value function with function approximation
v ^ ( s , w ) ≈ v π ( s ) o r q ^ ( s , a , w ) ≈ q π ( s , a ) \begin{aligned} \hat{v}(s,\mathbf{w}) &\approx v_\pi(s) \\ or\ \hat{q}(s,a,\mathbf{w}) &\approx q_\pi(s,a) \end{aligned} v^(s,w)or q^(s,a,w)≈vπ(s)≈qπ(s,a)
Types of Value Function Approximation
Which Funtion Approximator?
There are many funtion approximators, but we consider differentiable fucntion approximators, e.g.
- Linear combinations of features
- Neural network
- Decision tree
- Nearest neighbor
- Fourier / wavelet bases
- … \dots …
Incremental Methods
Value Funtion Approx. by SGD
Goal: find parameter vector
w
\mathbf{w}
w Minimising mean-squared error between approximate value function
v
^
(
s
,
w
)
\hat{v}(s,\mathbf{w})
v^(s,w) and true value function
v
π
(
s
)
v_\pi(s)
vπ(s) :
J
(
w
)
=
E
π
[
(
v
π
(
S
)
−
v
^
(
S
,
w
)
)
2
]
(1)
\pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(v_\pi(S) - \hat{v}(S, \mathbf{w}))^2 \right] \tag{1}
JJJ(w)=Eπ[(vπ(S)−v^(S,w))2](1)
Gradient descent finds a local minimum:
Δ
w
=
−
1
2
α
∇
w
J
(
w
)
=
α
E
π
[
(
v
π
(
S
)
−
v
^
(
S
,
w
)
)
∇
w
v
^
(
S
,
w
)
]
\Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \right]
Δw=−21α∇wJJJ(w)=αEπ[(vπ(S)−v^(S,w))∇wv^(S,w)]
Expected update is equal to full gradient update👆
SGD samples the gradient:
Δ
w
=
α
(
v
π
(
S
)
−
v
^
(
S
,
w
)
)
∇
w
v
^
(
S
,
w
)
(2)
\Delta \mathbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \tag{2}
Δw=α(vπ(S)−v^(S,w))∇wv^(S,w)(2)
Linear Funtion Approximation
Feature vectors: represent state by a feature vector
x
(
S
)
=
(
x
1
(
S
)
⋮
x
n
(
S
)
)
\mathbf{x}(S) = \left( \begin{matrix} \mathbf{x}_1(S) \\ \vdots \\ \mathbf{x}_n(S) \end{matrix} \right)
x(S)=⎝⎜⎛x1(S)⋮xn(S)⎠⎟⎞
Linear ---- Represent value function by a linear combination of features:
v
^
(
S
,
w
)
=
x
(
S
)
T
w
=
∑
j
=
1
n
x
j
(
S
)
w
j
=
x
(
S
)
⨀
w
点积
(3)
\hat{v}(S,\mathbf{w}) = \mathbf{x}(S)^{\text{T}}\mathbf{w} = \sum_{j=1}^n \mathbf{x}_j(S)\mathbf{w_j} = \mathbf{x}(S) \bigodot \mathbf{w} \qquad \text{点积} \tag{3}
v^(S,w)=x(S)Tw=j=1∑nxj(S)wj=x(S)⨀w点积(3)
easy to get: KaTeX parse error: \tag works only in display equations
代入式(2):
Δ
w
=
α
(
v
π
(
S
)
−
v
^
(
S
,
w
)
)
x
(
S
)
Update = learning rate * prediction error * feature value
\Delta \mathbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \mathbf{w})) \mathbf{x}(S) \\ \text{Update = learning rate * prediction error * feature value}
Δw=α(vπ(S)−v^(S,w))x(S)Update = learning rate * prediction error * feature value
可以发现 表格型方法 是 线性价值函数估计 的一种特例,其 feature vector 为:
x
t
a
b
l
e
(
S
)
=
(
1
(
S
=
s
1
)
⋮
1
(
S
=
s
n
)
)
\mathbf{x}^{table}(S) = \left( \begin{matrix} 1(\mathbf{S}=s_1) \\ \vdots \\ 1(\mathbf{S}=s_n) \end{matrix} \right)
xtable(S)=⎝⎜⎛1(S=s1)⋮1(S=sn)⎠⎟⎞
w
\mathbf{w}
w 中每一个元素代表一个状态的值:
v
^
(
S
,
w
)
=
(
1
(
S
=
s
1
)
⋮
1
(
S
=
s
n
)
)
⋅
(
w
1
⋮
w
n
)
\hat{v}(S,\mathbf{w}) = \left( \begin{matrix} 1(\mathbf{S}=s_1) \\ \vdots \\ 1(\mathbf{S}=s_n) \end{matrix} \right) \cdot \left( \begin{matrix} \mathbf{w}_1 \\ \vdots \\ \mathbf{w}_n \end{matrix} \right)
v^(S,w)=⎝⎜⎛1(S=s1)⋮1(S=sn)⎠⎟⎞⋅⎝⎜⎛w1⋮wn⎠⎟⎞
Incremental Prediction Algorithms
In RL there is no supervisor, i.e. no true value funtion v π ( s ) v_\pi(s) vπ(s). In practice, we substitute a target for v π ( s ) v_\pi(s) vπ(s)
-
For MC, the target is the return G t G_t Gt
Δ ( w ) = α ( G t − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}G_t} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) Δ(w)=α(Gt−v^(St,w))∇wv^(St,w) -
For TD(0), the target is the TD target R t + 1 + γ v ^ ( S t + 1 , w ) R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) Rt+1+γv^(St+1,w)
Δ ( w ) = α ( R t + 1 = γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}R_{t+1} = \gamma \hat{v}(S_{t+1}, \mathbf{w})} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) Δ(w)=α(Rt+1=γv^(St+1,w)−v^(St,w))∇wv^(St,w) -
For TD( λ \lambda λ), the target is the λ \lambda λ-return G t λ G_t^\lambda Gtλ
Δ ( w ) = α ( G t λ − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) (4) \Delta(\mathbf{w}) = \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) \tag{4} Δ(w)=α(Gtλ−v^(St,w))∇wv^(St,w)(4)-
Forward view linear TD( λ \lambda λ)
Δ ( w ) = α ( G t λ − v ^ ( S t , w ) ) ∇ w v ^ ( S t , w ) = α ( G t λ − v ^ ( S t , w ) ) x ( S t ) (4.1) \begin{aligned} \Delta(\mathbf{w}) &= \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S_t,\mathbf{w}) \\ &= \alpha ({\color{red}G_t^\lambda} - \hat{v}(S_t, \mathbf{w})) \mathbf{x}(S_t) \end{aligned} \tag{4.1} Δ(w)=α(Gtλ−v^(St,w))∇wv^(St,w)=α(Gtλ−v^(St,w))x(St)(4.1) -
Backward view linear TD( λ \lambda λ)
δ t = R t + 1 + γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) E t = γ λ E t − 1 + x ( S t ) Δ w = α δ t E t (4.2) \begin{aligned} \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \\ E_t &= \gamma \lambda E_{t-1} + \mathbf{x}(S_t) \\ \Delta \mathbf{w} &= \alpha \delta_t E_t \end{aligned} \tag{4.2} δtEtΔw=Rt+1+γv^(St+1,w)−v^(St,w)=γλEt−1+x(St)=αδtEt(4.2)
-
Control with Value Function Approximation
Action-Value Function Approximation
like value function Approximation
q
^
(
S
,
A
,
w
)
≈
q
π
(
S
,
A
)
\hat{q}(S,A,\mathbf{w}) \approx q_\pi(S,A)
q^(S,A,w)≈qπ(S,A)
J ( w ) = E π [ ( q π ( S , A ) − q ^ ( S , A , w ) ) 2 ] \pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(q_\pi(S,A) - \hat{q}(S,A, \mathbf{w}))^2 \right] JJJ(w)=Eπ[(qπ(S,A)−q^(S,A,w))2]
Δ w = − 1 2 α ∇ w J ( w ) = α E π [ ( q π ( S , A ) − q ^ ( S , A , w ) ) ∇ w q ^ ( S , A , w ) ] \Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(q_\pi(S,A) - \hat{q}(S,A, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S,A, \mathbf{w}) \right] Δw=−21α∇wJJJ(w)=αEπ[(qπ(S,A)−q^(S,A,w))∇wq^(S,A,w)]
Linear Action-Value Function Approximation
like [linear function approximation](#linear function approximation)
x
(
S
,
A
)
=
(
x
1
(
S
,
A
)
⋮
x
n
(
S
,
A
)
)
\mathbf{x}(S,A) = \left( \begin{matrix} \mathbf{x}_1(S,A) \\ \vdots \\ \mathbf{x}_n(S,A) \end{matrix} \right)
x(S,A)=⎝⎜⎛x1(S,A)⋮xn(S,A)⎠⎟⎞
q ^ ( S , A , w ) = x ( S , A ) T w = ∑ j = 1 n x j ( S , A ) w j = x ( S , A ) ⨀ w 点积 (3) \hat{q}(S,A, \mathbf{w}) = \mathbf{x}(S,A)^{\text{T}}\mathbf{w} = \sum_{j=1}^n \mathbf{x}_j(S,A)\mathbf{w_j} = \mathbf{x}(S,A) \bigodot \mathbf{w} \qquad \text{点积} \tag{3} q^(S,A,w)=x(S,A)Tw=j=1∑nxj(S,A)wj=x(S,A)⨀w点积(3)
Δ w = α ( q π ( S , A ) − q ^ ( S , A , w ) ) x ( S , A ) \Delta \mathbf{w} = \alpha (q_\pi(S,A) - \hat{q}(S,A, \mathbf{w})) \mathbf{x}(S,A) Δw=α(qπ(S,A)−q^(S,A,w))x(S,A)
Incremental Prediction Algorithms
Substitute a target for q π ( S , A ) q_\pi(S,A) qπ(S,A)
-
For MC, the target is the return G t G_t Gt
Δ ( w ) = α ( G t − q ^ ( S t , A t , w ) ) ∇ w q ^ ( S t , A t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}G_t} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) Δ(w)=α(Gt−q^(St,At,w))∇wq^(St,At,w) -
For TD(0), the target is the TD target R t + 1 + γ q ^ ( S t + 1 , A t + 1 , w ) R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w}) Rt+1+γq^(St+1,At+1,w)
Δ ( w ) = α ( R t + 1 + γ q ^ ( S t + 1 , A t + 1 , w ) − q ^ ( S t , A t , w ) ) ∇ w q ^ ( S t , A t , w ) \Delta(\mathbf{w}) = \alpha ({\color{red}R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w})} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) Δ(w)=α(Rt+1+γq^(St+1,At+1,w)−q^(St,At,w))∇wq^(St,At,w) -
For TD( λ \lambda λ), the target is the λ \lambda λ-return q t λ q_t^\lambda qtλ
Δ ( w ) = α ( q t λ − q ^ ( S t , A t , w ) ) ∇ w q ^ ( S t , A t , w ) (4) \Delta(\mathbf{w}) = \alpha ({\color{red}q_t^\lambda} - \hat{q}(S_t,A_t, \mathbf{w})) \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) \tag{4} Δ(w)=α(qtλ−q^(St,At,w))∇wq^(St,At,w)(4)- Backward view TD(
λ
\lambda
λ)
δ t = R t + 1 + γ q ^ ( S t + 1 , A t + 1 , w ) − q ^ ( S t , A t , w ) E t = γ λ E t − 1 + ∇ w q ^ ( S t , A t , w ) Δ w = α δ t E t \begin{aligned} \delta_t &= R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \\ E_t &= \gamma \lambda E_{t-1} + \nabla_{\mathbf{w}}\hat{q}(S_t,A_t, \mathbf{w}) \\ \Delta \mathbf{w} &= \alpha \delta_t E_t \end{aligned} δtEtΔw=Rt+1+γq^(St+1,At+1,w)−q^(St,At,w)=γλEt−1+∇wq^(St,At,w)=αδtEt
- Backward view TD(
λ
\lambda
λ)
Covergence of Prediction Algorithms
对于TD为什么不收敛,而 gradient-TD可以收敛的原因并不是很清楚。但原ppt是这么解释的:
TD does not follow the gradient of any objective function. This is why TD can diverge when off-policy or using non-linear function approximation. Gradient TD follows true gradient of projected Bellman error.
Covergence of Control Algorithms
Batch Methods
说白了就是之前 control 算法都是拿一个数据就更新一次,样本利用效率低,更新过程数据相关性强。
于是现在变成把一定量的数据存放起来,然后每次从其中随机 sample a btach of datas,用SGD进行梯度更新。目标函数还是真实值和估计值之间误差的平方