In book ‘Reinforcement Learning - An Introduction’, Chapter 3, the author gives out the Bellman equation for
v
π
v_\pi
vπ as equation (3.14), but without detailed derivation. That makes me feel confused and uncomfortable, so I try to derive the Bellman equation by myself. The details of derivation are gave out as below:
v
π
(
s
)
=
E
π
(
G
t
∣
S
t
=
s
)
=
E
π
(
R
t
+
1
+
γ
⋅
G
t
+
1
∣
S
t
=
s
)
=
E
π
(
R
t
+
1
∣
S
t
=
s
)
+
γ
⋅
E
π
(
G
t
+
1
∣
S
t
=
s
)
=
∑
a
[
E
π
(
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
⋅
P
r
(
A
t
=
a
∣
S
t
=
s
)
+
γ
⋅
E
π
(
G
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
⋅
P
r
(
A
t
=
a
∣
S
t
=
s
)
]
=
∑
a
P
r
(
A
t
=
a
∣
S
t
=
s
)
[
E
π
(
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
+
γ
⋅
E
π
(
G
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
]
=
∑
a
π
(
a
∣
s
)
[
∑
r
r
⋅
P
r
(
R
t
+
1
=
r
∣
S
t
=
s
,
A
t
=
a
)
+
γ
∑
g
g
⋅
P
r
(
G
t
+
1
=
g
∣
S
t
=
s
,
A
t
=
a
)
]
=
∑
a
π
(
a
∣
s
)
[
∑
r
∑
s
′
r
⋅
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
+
γ
⋅
∑
g
g
∑
r
∑
s
′
P
r
(
G
t
+
1
=
g
,
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
]
=
∑
a
π
(
a
∣
s
)
[
∑
r
∑
s
′
r
⋅
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
+
γ
⋅
∑
g
g
∑
r
∑
s
′
P
r
(
G
t
+
1
=
g
,
R
t
+
1
=
r
,
S
t
+
1
=
s
′
,
S
t
=
s
,
A
t
=
a
)
P
r
(
S
t
=
s
,
A
t
=
a
)
]
=
∑
a
π
(
a
∣
s
)
{
∑
r
∑
s
′
r
⋅
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
+
γ
⋅
∑
g
g
∑
r
∑
s
′
[
P
r
(
G
t
+
1
=
g
∣
R
t
+
1
=
r
,
S
t
+
1
=
s
′
,
S
t
=
s
,
A
t
=
a
)
⋅
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
P
r
(
S
t
=
s
,
A
t
=
a
)
/
P
r
(
S
t
=
s
,
A
t
=
a
)
]
}
=
∑
a
π
(
a
∣
s
)
{
∑
r
∑
s
′
r
⋅
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
+
γ
⋅
∑
g
g
∑
r
∑
s
′
[
P
r
(
G
t
+
1
=
g
∣
R
t
+
1
=
r
,
S
t
+
1
=
s
′
,
S
t
=
s
,
A
t
=
a
)
⋅
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
]
}
=
∑
a
π
(
a
∣
s
)
{
∑
r
∑
s
′
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
⋅
[
r
+
γ
∑
g
g
⋅
P
r
(
G
t
+
1
=
g
∣
R
t
+
1
=
r
,
S
t
+
1
=
s
′
,
S
t
=
s
,
A
t
=
a
)
]
}
\begin{aligned} v_\pi(s) &= \mathbb E_\pi (G_t \mid S_t = s) \\ &= \mathbb E_\pi(R_{t+1} + \gamma \cdot G_{t+1} \mid S_t = s) \\ &= \mathbb E_\pi(R_{t+1} \mid S_t = s) + \gamma \cdot \mathbb E_\pi(G_{t+1} \mid S_t = s) \\ &= \sum_a \bigl [ \mathbb E_\pi (R_{t+1} \mid S_t = s, A_t = a) \cdot Pr(A_t = a \mid S_t =s) \\ &\quad + \gamma \cdot \mathbb E_\pi(G_{t+1} \mid S_t = s, A_t = a)\cdot Pr(A_t= a \mid S_t =s) \bigr ] \\ &= \sum_a Pr(A_t = a\mid S_t = s) \bigl [ \mathbb E_\pi(R_{t+1} \mid S_t = s, A_t =a) + \gamma \cdot \mathbb E_\pi (G_{t+1} \mid S_t =s, A_t = a) \bigr] \\ &= \sum_a \pi(a\mid s) \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r \mid S_t = s, A_t = a) + \gamma \sum_g g \cdot Pr(G_{t+1} = g \mid S_t = s, A_t = a) \Bigr ] \\ &= \sum_a \pi(a \mid s) \Bigl [ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} Pr(G_{t+1} = g, R_{t+1} = r, S_{t+1} = s' \mid S_t = s, A_t = a) \Bigr ] \\ &= \sum_a \pi(a \mid s) \Bigl [ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} \frac {Pr(G_{t+1} = g, R_{t+1} = r, S_{t+1} = s' , S_t = s, A_t = a)} {Pr(S_t = s, A_t = a)} \Bigr ] \\ &= \sum_a \pi(a \mid s) \biggl \{ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} \Bigl [ Pr(G_{t+1} = g \mid R_{t+1} = r, S_{t+1} = s' , S_t = s, A_t = a) \\ &\quad \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t = s, A_t = a) Pr(S_t = s, A_t = a) /Pr(S_t = s, A_t = a) \Bigr ] \biggr \} \\ &= \sum_a \pi(a \mid s) \biggl \{ \sum_r \sum_{s'} r \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t =a) \\ &\quad + \gamma \cdot \sum_g g \sum_r \sum_{s'} \Bigl [ Pr(G_{t+1} = g \mid R_{t+1} = r, S_{t+1} = s' , S_t = s, A_t = a) \\ &\quad \cdot Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t = s, A_t = a) \Bigr ] \biggr \} \\ &= \sum_a \pi(a \mid s) \biggl \{ \sum_r \sum_{s'} Pr(R_{t+1} = r, S_{t+1} = s' |S_t = s, A_t = a) \\ &\quad \cdot \Bigl [ r + \gamma \sum_g g \cdot Pr(G_{t+1} = g| R_{t+1} =r, S_{t+1} = s', S_t =s, A_t = a) \Bigr ] \biggr \} \end{aligned}
vπ(s)=Eπ(Gt∣St=s)=Eπ(Rt+1+γ⋅Gt+1∣St=s)=Eπ(Rt+1∣St=s)+γ⋅Eπ(Gt+1∣St=s)=a∑[Eπ(Rt+1∣St=s,At=a)⋅Pr(At=a∣St=s)+γ⋅Eπ(Gt+1∣St=s,At=a)⋅Pr(At=a∣St=s)]=a∑Pr(At=a∣St=s)[Eπ(Rt+1∣St=s,At=a)+γ⋅Eπ(Gt+1∣St=s,At=a)]=a∑π(a∣s)[r∑r⋅Pr(Rt+1=r∣St=s,At=a)+γg∑g⋅Pr(Gt+1=g∣St=s,At=a)]=a∑π(a∣s)[r∑s′∑r⋅Pr(Rt+1=r,St+1=s′∣St=s,At=a)+γ⋅g∑gr∑s′∑Pr(Gt+1=g,Rt+1=r,St+1=s′∣St=s,At=a)]=a∑π(a∣s)[r∑s′∑r⋅Pr(Rt+1=r,St+1=s′∣St=s,At=a)+γ⋅g∑gr∑s′∑Pr(St=s,At=a)Pr(Gt+1=g,Rt+1=r,St+1=s′,St=s,At=a)]=a∑π(a∣s){r∑s′∑r⋅Pr(Rt+1=r,St+1=s′∣St=s,At=a)+γ⋅g∑gr∑s′∑[Pr(Gt+1=g∣Rt+1=r,St+1=s′,St=s,At=a)⋅Pr(Rt+1=r,St+1=s′∣St=s,At=a)Pr(St=s,At=a)/Pr(St=s,At=a)]}=a∑π(a∣s){r∑s′∑r⋅Pr(Rt+1=r,St+1=s′∣St=s,At=a)+γ⋅g∑gr∑s′∑[Pr(Gt+1=g∣Rt+1=r,St+1=s′,St=s,At=a)⋅Pr(Rt+1=r,St+1=s′∣St=s,At=a)]}=a∑π(a∣s){r∑s′∑Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅[r+γg∑g⋅Pr(Gt+1=g∣Rt+1=r,St+1=s′,St=s,At=a)]}
∵
\because
∵ In Markov Process,
G
t
+
1
G_{t+1}
Gt+1 only relate to
S
t
+
1
S_{t+1}
St+1,
S
t
S_t
St and
A
t
A_t
At give no contribution to
G
t
+
1
G_{t+1}
Gt+1,
∴
P
r
(
G
t
+
1
=
g
∣
R
t
=
1
=
r
,
S
t
+
1
=
s
′
,
S
t
=
s
,
A
t
=
a
)
=
P
r
(
G
t
+
1
=
g
∣
S
t
+
1
=
s
′
)
\therefore Pr(G_{t+1} = g \mid R_{t=1}= r, S_{t+1} = s', S_t = s, A_t = a) = Pr(G_{t+1} = g \mid S_{t+1} =s')
∴Pr(Gt+1=g∣Rt=1=r,St+1=s′,St=s,At=a)=Pr(Gt+1=g∣St+1=s′)
∴
v
π
(
s
)
=
∑
a
π
(
a
∣
s
)
{
∑
r
∑
s
′
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
⋅
[
r
+
γ
∑
g
g
⋅
P
r
(
G
t
+
1
=
g
∣
S
t
+
1
=
s
′
)
]
}
=
∑
a
π
(
a
∣
s
)
{
∑
r
∑
s
′
P
r
(
R
t
+
1
=
r
,
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
⋅
[
r
+
γ
E
π
(
G
t
+
1
∣
S
t
+
1
=
s
′
)
]
}
=
∑
a
π
(
a
∣
s
)
{
∑
r
∑
s
′
p
(
r
,
s
′
∣
s
,
a
)
⋅
[
r
+
γ
v
π
(
s
′
)
]
}
\begin{aligned} \therefore v_\pi(s) &= \sum_a \pi ( a \mid s) \biggl \{ \sum_r \sum_{s'}Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t = a) \\ &\quad \cdot \Bigl [ r + \gamma \sum_g g \cdot Pr(G_{t+1} = g \mid S_{t+1} = s') \Bigr ] \biggr \} \\ &= \sum_a \pi ( a \mid s) \biggl \{ \sum_r \sum_{s'}Pr(R_{t+1} = r, S_{t+1} = s' \mid S_t =s, A_t = a) \\ &\quad \cdot \Bigl [ r + \gamma \mathbb E_\pi(G_{t+1} \mid S_{t+1} = s') \Bigr ] \biggr \} \\ &= \sum_a \pi ( a \mid s) \biggl \{ \sum_r \sum_{s'}p( r, s' \mid s, a) \cdot \Bigl [ r + \gamma v_\pi(s') \Bigr ] \biggr \} \\ \end{aligned}
∴vπ(s)=a∑π(a∣s){r∑s′∑Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅[r+γg∑g⋅Pr(Gt+1=g∣St+1=s′)]}=a∑π(a∣s){r∑s′∑Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅[r+γEπ(Gt+1∣St+1=s′)]}=a∑π(a∣s){r∑s′∑p(r,s′∣s,a)⋅[r+γvπ(s′)]}
That’s the Bellman equation for
v
π
v_\pi
vπ. We get it.
The derivation of Bellman equation for value of a policy
最新推荐文章于 2024-02-07 15:35:13 发布