Exercise 3.17 What is the Bellman equation for action values, that is, for
q
π
q_\pi
qπ? It must give the action value
q
π
(
s
,
a
)
q_\pi(s, a)
qπ(s,a) in terms of the action values,
q
π
(
s
′
,
a
′
)
q_\pi(s', a')
qπ(s′,a′), of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
According to definition:
Q
π
(
s
,
a
)
=
E
π
(
G
t
∣
S
t
=
s
,
A
t
=
a
)
=
E
π
(
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
∣
S
t
=
s
,
A
t
=
a
)
=
∑
s
′
[
E
π
(
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
P
(
S
t
+
1
=
s
′
∣
A
t
=
a
,
S
t
=
s
)
]
=
∑
s
′
{
[
E
π
(
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
)
+
E
π
(
∑
k
=
1
∞
γ
k
R
t
+
1
+
k
)
]
P
(
S
t
+
1
=
s
′
∣
A
t
=
a
,
S
t
=
s
)
}
\begin{aligned} Q_\pi(s,a) &= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &= \mathbb E_\pi (\sum_{k=0}^\infty \gamma^k R_t+k+1 | S_t=s, A_t=a) \\ &= \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a, S_{t+1}=s' ) P( S_{t+1} =s' | A_t = a, S_t = s ) \bigr] \\ &= \sum_{s'} \Bigl\{ \bigl[ \mathbb E_\pi ( R_{t+1} | S_t = s , A_t = a , S_{t+1} = s ) + \mathbb E_\pi ( \sum_{k=1}^\infty \gamma^k R_{t+1+k} ) \bigr] P( S_{t+1} = s' | A_t = a , S_t = s ) \Bigr\} \end{aligned}
Qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(k=0∑∞γkRt+k+1∣St=s,At=a)=s′∑[Eπ(k=0∑∞γkRt+k+1∣St=s,At=a,St+1=s′)P(St+1=s′∣At=a,St=s)]=s′∑{[Eπ(Rt+1∣St=s,At=a,St+1=s)+Eπ(k=1∑∞γkRt+1+k)]P(St+1=s′∣At=a,St=s)}
Denote
P
(
S
t
+
1
=
s
′
∣
A
t
=
a
,
S
t
=
s
)
=
P
s
,
s
′
a
P(S_{t+1} = s' | A_t = a , S_t = s ) = P_{s,s'}^a
P(St+1=s′∣At=a,St=s)=Ps,s′a
E
π
(
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
=
R
s
,
s
′
a
\mathbb E_\pi (R_{t+1} | S_t = s , A_t = a , S_{t+1} = s' ) = R_{s,s'}^a
Eπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′a
then:
Q
π
(
s
,
a
)
=
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
∑
s
′
[
E
(
∑
k
=
1
∞
γ
k
R
t
+
1
+
k
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
P
s
,
s
′
a
]
=
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
γ
∑
s
′
[
E
(
∑
k
=
1
∞
γ
k
−
1
R
t
+
1
+
k
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
P
s
,
s
′
a
]
=
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
γ
∑
s
′
[
E
(
∑
k
=
0
∞
γ
k
R
t
+
2
+
k
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
P
s
,
s
′
a
]
=
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
γ
∑
s
′
[
E
(
∑
k
=
0
∞
γ
k
R
t
+
2
+
k
∣
S
t
+
1
=
s
′
)
P
s
,
s
′
a
]
=
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
γ
∑
s
′
{
∑
a
′
[
E
(
∑
k
=
0
∞
γ
k
R
t
+
2
+
k
∣
S
t
+
1
=
s
′
,
A
t
+
1
=
a
′
)
P
(
A
t
+
1
=
a
′
∣
S
t
+
1
=
s
′
)
]
P
s
,
s
′
a
}
\begin{aligned} Q_\pi(s,a) &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \sum_{s'} \bigl[ \mathbb E(\sum_{k=1}^\infty \gamma^k R_{t+1+k} | S_t = s, A_t = a, S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=1}^\infty \gamma^{k-1} R_{t+1+k} | S_t=s,A_t=a,S_{t+1}=s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_t = s , A_t = a , S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \Bigl\{ \sum_{a'} \bigl[ \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a' ) P( A_{t+1} = a' | S_{t+1} =s' ) \bigr] P_{s,s'}^a \Bigr\} \\ \end{aligned}
Qπ(s,a)=s′∑Rs,s′aPss′a+s′∑[E(k=1∑∞γkRt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑[E(k=1∑∞γk−1Rt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑[E(k=0∑∞γkRt+2+k∣St=s,At=a,St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑[E(k=0∑∞γkRt+2+k∣St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑{a′∑[E(k=0∑∞γkRt+2+k∣St+1=s′,At+1=a′)P(At+1=a′∣St+1=s′)]Ps,s′a}
According to definition
E
(
∑
k
=
0
∞
γ
k
R
t
+
2
+
k
∣
S
t
+
1
=
s
′
,
A
t
+
1
=
a
′
)
=
Q
π
(
s
′
,
a
′
)
P
(
A
t
+
1
=
a
′
∣
S
t
+
1
=
s
′
)
=
π
(
s
′
,
a
′
)
\mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a') = Q_\pi(s',a') \\ P( A_{t+1} = a' | S_{t+1} = s' ) = \pi(s',a')
E(k=0∑∞γkRt+2+k∣St+1=s′,At+1=a′)=Qπ(s′,a′)P(At+1=a′∣St+1=s′)=π(s′,a′)
so
Q
π
(
s
,
a
)
=
∑
s
′
R
s
,
s
′
a
P
s
s
′
a
+
γ
∑
s
′
[
∑
a
′
Q
π
(
s
′
,
a
′
)
π
(
s
′
,
a
′
)
]
P
s
,
s
′
a
Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a
Qπ(s,a)=s′∑Rs,s′aPss′a+γs′∑[a′∑Qπ(s′,a′)π(s′,a′)]Ps,s′a
This is the Bellman equation of action-value.