The book doesn’t explain the formula (5.2) clearly, and the second and third lines of the formula (5.2) in page 101 made me confused. So, here, I make it clear to be understood.
First,
q
π
(
s
,
π
′
(
s
)
)
=
∑
a
π
′
(
a
∣
s
)
q
π
(
s
,
a
)
∵
for all
π
(
a
∣
s
)
,
there is
π
(
a
∣
s
)
=
{
1
−
ϵ
+
ϵ
/
∣
A
(
s
)
∣
if
a
=
A
∗
ϵ
/
∣
A
(
s
)
∣
if
a
=
̸
A
∗
∴
q
π
(
s
,
π
′
(
s
)
)
=
∑
a
(
a
=
̸
A
∗
)
ϵ
∣
A
(
s
)
∣
q
π
(
s
,
a
)
+
(
1
−
ϵ
+
ϵ
∣
A
(
s
)
∣
)
q
π
(
s
,
a
=
A
∗
)
=
ϵ
∣
A
(
s
)
∣
∑
a
(
a
=
̸
A
∗
)
q
π
(
s
,
a
)
+
ϵ
∣
A
(
s
)
∣
q
π
(
s
,
a
=
A
∗
)
+
(
1
−
ϵ
)
q
π
(
s
,
a
=
A
∗
)
=
ϵ
∣
A
(
s
)
∣
∑
a
q
π
(
s
,
a
)
+
(
1
−
ϵ
)
max
a
q
π
(
s
,
a
)
this is the second line of formula (5.2)
q_\pi(s, \pi'(s)) = \sum_a \pi'(a \mid s) q_\pi(s,a) \\ \because \text{for all }\pi(a \mid s), \text{there is } \pi(a \mid s) = \begin{cases} 1 - \epsilon + \epsilon / | \mathcal A(s)| & \text{if } a = A^* \\ \epsilon / | \mathcal A(s) |& \text{if } a = \not A^* \\ \end{cases} \\ \begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \sum_{a(a = \not A^*)} \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a) + (1 - \epsilon + \frac{\epsilon}{|\mathcal A(s)|})q_\pi(s,a = A^*) \\ &=\frac {\epsilon} {| \mathcal A(s) |} \sum_{a(a = \not A^*)} q_\pi(s,a) + \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a = A^*) + (1-\epsilon)q_\pi(s,a = A^*) \\ &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \qquad \text{this is the second line of formula (5.2)} \end{aligned}
qπ(s,π′(s))=a∑π′(a∣s)qπ(s,a)∵for all π(a∣s),there is π(a∣s)={1−ϵ+ϵ/∣A(s)∣ϵ/∣A(s)∣if a=A∗if a≠A∗∴qπ(s,π′(s))=a(a≠A∗)∑∣A(s)∣ϵqπ(s,a)+(1−ϵ+∣A(s)∣ϵ)qπ(s,a=A∗)=∣A(s)∣ϵa(a≠A∗)∑qπ(s,a)+∣A(s)∣ϵqπ(s,a=A∗)+(1−ϵ)qπ(s,a=A∗)=∣A(s)∣ϵa∑qπ(s,a)+(1−ϵ)amaxqπ(s,a)this is the second line of formula (5.2)
Consider value
x
x
x, let
x
=
∑
a
[
π
(
a
∣
s
)
−
ϵ
∣
A
(
s
)
∣
]
q
π
(
s
,
a
)
x =\sum_a \Bigl [ \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s,a)
x=a∑[π(a∣s)−∣A(s)∣ϵ]qπ(s,a)
When
a
=
̸
A
∗
a = \not A^*
a≠A∗,
π
(
a
∣
s
)
=
ϵ
/
∣
A
(
s
)
∣
\pi(a \mid s) = \epsilon/| \mathcal A(s) |
π(a∣s)=ϵ/∣A(s)∣
∴
x
=
[
π
(
a
=
A
∗
∣
s
)
−
ϵ
∣
A
(
s
)
∣
]
q
π
(
s
,
a
=
A
∗
)
=
[
1
−
ϵ
+
ϵ
∣
A
(
s
)
∣
−
ϵ
∣
A
(
s
)
∣
]
q
π
(
s
,
a
=
A
∗
)
=
(
1
−
ϵ
)
q
π
(
s
,
a
=
A
∗
)
=
(
1
−
ϵ
)
max
a
q
π
(
s
,
a
)
≤
max
a
q
π
(
s
,
a
)
\begin{aligned} \therefore x &= \Bigl [ \pi(a = A^* \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s, a = A^*) \\ &= \Bigl [ 1 - \epsilon + \frac {\epsilon}{| \mathcal A(s) |} - \frac {\epsilon}{| \mathcal A(s) |}\Bigr ]q_\pi(s, a=A^*) \\ &= ( 1 - \epsilon) q_\pi(s, a=A^*) \\ &= (1-\epsilon)\max_aq_\pi(s,a) \\ &\leq \max_a q_\pi(s,a) \end{aligned}
∴x=[π(a=A∗∣s)−∣A(s)∣ϵ]qπ(s,a=A∗)=[1−ϵ+∣A(s)∣ϵ−∣A(s)∣ϵ]qπ(s,a=A∗)=(1−ϵ)qπ(s,a=A∗)=(1−ϵ)amaxqπ(s,a)≤amaxqπ(s,a)
Also
x
=
(
1
−
ϵ
)
∑
a
π
(
a
∣
s
)
−
ϵ
∣
A
(
s
)
∣
1
−
ϵ
q
π
(
s
,
a
)
x = (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a)
x=(1−ϵ)a∑1−ϵπ(a∣s)−∣A(s)∣ϵqπ(s,a)
∴
q
π
(
s
,
π
′
(
s
)
)
=
ϵ
∣
A
(
s
)
∣
∑
a
q
π
(
s
,
a
)
+
(
1
−
ϵ
)
max
a
q
π
(
s
,
a
)
≥
ϵ
∣
A
(
s
)
∣
∑
a
q
π
(
s
,
a
)
+
(
1
−
ϵ
)
∑
a
π
(
a
∣
s
)
−
ϵ
∣
A
(
s
)
∣
1
−
ϵ
q
π
(
s
,
a
)
\begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \\ & \geq \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a) \end{aligned}
∴qπ(s,π′(s))=∣A(s)∣ϵa∑qπ(s,a)+(1−ϵ)amaxqπ(s,a)≥∣A(s)∣ϵa∑qπ(s,a)+(1−ϵ)a∑1−ϵπ(a∣s)−∣A(s)∣ϵqπ(s,a)
This is the third line of formula (5.2). It’s clear to be understood now.
Reinforcement Learning--Explanation to Formula (5.2)
最新推荐文章于 2021-10-22 21:17:02 发布