The novelty of this chapter is that the approximate value function is represented not as a table but as a parameterized function form with weight vector w ∈ R d \mathbf{w}\in \mathbb{R}^{d} w∈Rd
What function approximation can not do is that augment the state representation with memories of past observations.
与tabular method不同的一点是:
When a single state is updated, the change generalizes from that state to affect the values of many other states. Such generalization makes the learning potentially more powerful but also potentially more difficult to manage and understand. In tabular method, the learned values at each state were decoupled — an update at one state affected no other.
‘update’ notation: s ↦ u s \mapsto u s↦u, where s s s is the state updated and u u u is the update target that s s s's estimated value is shifted toward.
semi-gradient methods
Prediction Objective
a natural objective function, the
M
e
a
n
S
q
u
a
r
e
d
V
a
l
u
e
E
r
r
o
r
Mean Squared Value Error
MeanSquaredValueError, donated
V
E
‾
(
w
)
≐
∑
s
∈
S
μ
(
s
)
[
v
π
(
s
)
−
v
^
(
s
,
w
)
]
2
\overline{VE}(\mathbf{w}) \doteq \sum_{s\in S}\mu(s)[v_{\pi}(s) - \hat{v}(s, \mathbf{w})]^{2}
VE(w)≐s∈S∑μ(s)[vπ(s)−v^(s,w)]2
where
μ
(
s
)
\mu(s)
μ(s)------state weighting or distribution
μ
(
s
)
≥
0
,
∑
s
∈
S
μ
(
s
)
=
1
\mu(s)\geq0, \sum_{s\in S}\mu(s) =1
μ(s)≥0,∑s∈Sμ(s)=1
Stochastic-gradient and Semi-gradient Methods
In particular, there is generally no w \mathbf{w} w that gets all the states, or even all the examples, exactly correct. In addition, we must generalize to all the other states that have not appeared in examples.
Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example (We assume that states appear in examples with the same distribution
μ
\mu
μ):
w
t
+
1
≐
w
t
−
1
2
α
▽
[
v
π
(
S
t
)
−
v
^
(
S
t
,
w
t
)
]
2
=
w
t
+
α
[
v
π
(
S
t
)
−
v
^
(
S
t
,
w
t
)
]
▽
v
^
(
S
t
,
w
t
)
\mathbf{w_{t+1}} \doteq \mathbf{w_t} - \frac{1}{2} \alpha \triangledown [v_\pi(S_t) - \hat{v}(S_t, \mathbf{w_t})]^2 = \mathbf{w_t}+\alpha[v_\pi(S_t)-\hat{v}(S_t, \mathbf{w_t})]\triangledown \hat{v}(S_t, \mathbf{w_t})
wt+1≐wt−21α▽[vπ(St)−v^(St,wt)]2=wt+α[vπ(St)−v^(St,wt)]▽v^(St,wt)
where
α
\alpha
α------a positive step-size parameter
▽
f
(
w
)
\triangledown f(\mathbf{w})
▽f(w)------the vector of partial derivatives with respect to the components of the weight vector for any scalar expression
Gradient descent methods are called “stochastic” when the update is done, as here, on only a single example, which might have been selected stochastically. Over many examples, make small steps, the overall effect is to minimize an average performance measure such as the V E ‾ \overline{VE} VE
Linear Methods
Linear methods approximate state-value function by the inner product between
w
\mathbf{w}
w and
x
(
s
)
\mathbf{x}(s)
x(s):
v
^
(
s
,
w
)
≐
w
⊤
x
(
s
)
≐
∑
i
=
1
d
w
i
x
i
(
s
)
\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^{\top}\mathbf{x}(s)\doteq\sum_{i=1}^{d}w_{i}x_{i}(s)
v^(s,w)≐w⊤x(s)≐i=1∑dwixi(s)
where
x
(
s
)
≐
(
x
1
(
s
)
,
x
2
(
s
)
,
.
.
.
,
x
d
(
s
)
)
⊤
\mathbf{x}(s)\doteq(x_1(s), x_2(s), ..., x_d(s))^{\top}
x(s)≐(x1(s),x2(s),...,xd(s))⊤, is called feature vector representing state s
Feature Construction for Linear Methods
Choosing features appropriate to the task is an import way of adding prior domain knowledge to reinforcement learning system.
相比于非线性逼近,线性逼近的好处是只有一个最优值,因此可以收敛到全局最优。