Exercise 3.13 Give an equation for q π q_\pi qπ in terms of v π v_\pi vπ and the four-argument p p p.
First, we need to derive a formula from multiplication formula of probability theory:
p
(
x
∣
y
)
=
p
(
x
,
y
)
p
(
y
)
=
∑
z
p
(
x
,
y
,
z
)
p
(
y
)
=
∑
z
[
p
(
x
∣
y
,
z
)
⋅
p
(
z
∣
y
)
⋅
p
(
y
)
]
p
(
y
)
=
∑
z
[
p
(
x
∣
y
,
z
)
⋅
p
(
z
∣
y
)
]
(
1
)
\begin{aligned} p(x|y) &= \frac {p(x,y)}{p(y)} \\ &= \frac {\sum_z p(x,y,z)}{p(y)} \\ &= \frac {\sum_z \bigl [ p(x |y,z) \cdot p(z|y) \cdot p(y) \bigr ] } { p(y) } \\ &= \sum_z \bigl [ p(x|y,z) \cdot p(z|y) \bigr ] \qquad \qquad{(1)}\\ \end{aligned}
p(x∣y)=p(y)p(x,y)=p(y)∑zp(x,y,z)=p(y)∑z[p(x∣y,z)⋅p(z∣y)⋅p(y)]=z∑[p(x∣y,z)⋅p(z∣y)](1)
With formula (1), we can calculate
q
π
(
s
,
a
)
q_\pi(s,a)
qπ(s,a) as below:
q
π
(
s
,
a
)
=
E
π
(
G
t
∣
S
t
=
s
,
A
t
=
a
)
=
E
π
(
R
t
+
1
+
γ
G
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
=
E
π
(
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
+
γ
E
π
(
G
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
=
∑
r
r
⋅
P
r
(
R
t
+
1
=
r
∣
S
t
=
s
,
A
t
=
a
)
+
γ
∑
g
t
+
1
g
t
+
1
⋅
P
r
(
G
t
+
1
=
g
t
+
1
∣
S
t
=
s
,
A
t
=
a
)
\begin{aligned} q_\pi(s,a)&= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &=\mathbb E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi(R_{t+1} | S_t = s, A_t = a) + \gamma \mathbb E_\pi(G_{t+1}|S_t = s, A_t = a) \\ &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=g_{t+1}|S_t=s, A_t=a) \\ \end{aligned}
qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=r∑r⋅Pr(Rt+1=r∣St=s,At=a)+γgt+1∑gt+1⋅Pr(Gt+1=gt+1∣St=s,At=a)
Here, according to definition,
g
t
+
1
=
∑
k
=
0
∞
γ
k
⋅
r
t
+
2
+
k
g_{t+1} = \sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}
gt+1=∑k=0∞γk⋅rt+2+k. And use formula (1), we can derive:
q
π
(
s
,
a
)
=
∑
r
r
⋅
P
r
(
R
t
+
1
=
r
∣
S
t
=
s
,
A
t
=
a
)
+
γ
∑
g
t
+
1
g
t
+
1
⋅
P
r
(
G
t
+
1
=
∑
k
=
0
∞
γ
k
⋅
r
t
+
2
+
k
∣
S
t
=
s
,
A
t
=
a
)
=
∑
r
r
⋅
∑
s
′
P
r
(
R
t
+
1
=
r
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
⋅
P
r
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
+
γ
∑
g
t
+
1
g
t
+
1
⋅
∑
s
′
P
r
(
G
t
+
1
=
∑
k
=
0
∞
γ
k
⋅
r
t
+
2
+
k
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
⋅
P
r
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
a
t
=
a
)
\begin{aligned} q_\pi(s,a) &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a) \\ &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a, S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ \end{aligned}
qπ(s,a)=r∑r⋅Pr(Rt+1=r∣St=s,At=a)+γgt+1∑gt+1⋅Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St=s,At=a)=r∑r⋅s′∑Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γgt+1∑gt+1⋅s′∑Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)
Because in Markov Process,
G
t
+
1
G_{t+1}
Gt+1 is the reward of status
S
t
+
1
=
s
′
S_{t+1} = s'
St+1=s′, the information of
S
t
=
s
S_t = s
St=s and
A
t
=
a
A_t = a
At=a are no effect on
G
t
+
1
G_{t+1}
Gt+1. So:
q
π
(
s
,
a
)
=
∑
r
r
⋅
∑
s
′
P
r
(
R
t
+
1
=
r
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
⋅
P
r
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
+
γ
∑
g
t
+
1
g
t
+
1
⋅
∑
s
′
P
r
(
G
t
+
1
=
∑
k
=
0
∞
γ
k
⋅
r
t
+
2
+
k
∣
S
t
+
1
=
s
′
)
⋅
P
r
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
a
t
=
a
)
=
∑
s
′
{
P
r
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
⋅
[
∑
r
r
⋅
P
r
(
R
t
+
1
=
r
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
+
γ
∑
g
t
+
1
g
t
+
1
⋅
P
r
(
G
t
+
1
=
∑
k
=
0
∞
γ
k
⋅
r
t
+
2
+
k
∣
S
t
+
1
=
s
′
)
]
}
=
∑
s
′
{
p
(
s
′
∣
s
,
a
)
⋅
[
E
π
(
r
∣
s
,
a
,
s
′
)
+
γ
⋅
E
π
(
G
t
+
1
∣
S
t
+
1
=
s
′
)
]
}
=
∑
s
′
{
p
(
s
′
∣
s
,
a
)
⋅
[
E
π
(
r
∣
s
,
a
,
s
′
)
+
γ
⋅
v
π
(
s
′
)
]
}
\begin{aligned} q_\pi(s,a) &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ &= \sum_{s'} \biggl \{ Pr(S_{t+1} = s' | S_t=s, A_t=a) \cdot \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \Bigr ] \biggr \}\\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot \mathbb E_\pi(G_{t+1}|S_{t+1}=s') \Bigr ] \biggr \} \\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \\ \end{aligned}
qπ(s,a)=r∑r⋅s′∑Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γgt+1∑gt+1⋅s′∑Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)=s′∑{Pr(St+1=s′∣St=s,At=a)⋅[r∑r⋅Pr(Rt+1=r∣St=s,At=a,St+1=s′)+γgt+1∑gt+1⋅Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St+1=s′)]}=s′∑{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅Eπ(Gt+1∣St+1=s′)]}=s′∑{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅vπ(s′)]}
Denote
p
(
s
′
∣
a
,
s
)
=
P
s
,
s
′
a
p(s' | a , s ) = P_{s,s'}^a
p(s′∣a,s)=Ps,s′a and
E
π
(
r
∣
s
,
a
,
s
′
)
=
R
s
,
s
′
a
\mathbb E_\pi ( r | s, a, s' ) = R_{s,s'}^a
Eπ(r∣s,a,s′)=Rs,s′a, then
q
π
(
s
,
a
)
=
∑
s
′
{
P
s
,
s
′
a
⋅
[
R
s
,
s
′
a
+
γ
⋅
v
π
(
s
′
)
]
}
(
2
)
q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \qquad{(2)}
qπ(s,a)=s′∑{Ps,s′a⋅[Rs,s′a+γ⋅vπ(s′)]}(2)
Here, the equation (2) is the result.