把扫地机器人简化成以下条件:
状态序列:
{
0
,
1
,
2
,
3
,
4
,
5
}
\{0,1,2,3,4,5\}
{0,1,2,3,4,5}
行为序列:
{
−
1
,
+
1
}
\{-1,+1\}
{−1,+1}
转移函数:
f
ˉ
(
0
,
±
1
)
=
0
,
f
ˉ
(
1
,
+
1
)
=
2
,
f
ˉ
(
1
,
−
1
)
=
0
,
f
ˉ
(
2
,
+
1
)
=
3
,
f
ˉ
(
2
,
−
1
)
=
1
\bar f(0,\pm 1)=0 , \bar f(1,+ 1)=2, \bar f(1,- 1)=0, \bar f(2,+ 1)=3, \bar f(2,- 1)=1
fˉ(0,±1)=0,fˉ(1,+1)=2,fˉ(1,−1)=0,fˉ(2,+1)=3,fˉ(2,−1)=1
f
ˉ
(
3
,
+
1
)
=
4
,
f
ˉ
(
3
,
−
1
)
=
2
,
f
ˉ
(
4
,
+
1
)
=
5
,
f
ˉ
(
4
,
−
1
)
=
4
,
f
ˉ
(
5
,
±
1
)
=
5
\bar f(3,+1)=4, \bar f(3,- 1)=2, \bar f(4,+ 1)=5, \bar f(4,- 1)=4, \bar f(5,\pm 1)=5
fˉ(3,+1)=4,fˉ(3,−1)=2,fˉ(4,+1)=5,fˉ(4,−1)=4,fˉ(5,±1)=5
奖励函数:
ρ
(
0
,
±
1
,
0
)
=
0
,
ρ
(
1
,
+
1
,
2
)
=
0
,
ρ
(
1
,
−
1
,
0
)
=
1
,
ρ
(
2
,
+
1
,
3
)
=
0
,
ρ
(
2
,
−
1
,
1
)
=
0
\rho(0,\pm1,0)=0, \rho(1,+1,2)=0, \rho(1,-1,0)=1, \rho(2,+1,3)=0, \rho(2,-1,1)=0
ρ(0,±1,0)=0,ρ(1,+1,2)=0,ρ(1,−1,0)=1,ρ(2,+1,3)=0,ρ(2,−1,1)=0
ρ
(
3
,
+
1
,
4
)
=
0
,
ρ
(
3
,
−
1
,
2
)
=
0
,
ρ
(
4
,
+
1
,
5
)
=
5
,
ρ
(
4
,
−
1
,
3
)
=
0
,
ρ
(
5
,
±
1
,
5
)
=
0
\rho(3,+1,4)=0, \rho(3,-1,2)=0, \rho(4,+1,5)=5, \rho(4,-1,3)=0, \rho(5,\pm1,5)=0
ρ(3,+1,4)=0,ρ(3,−1,2)=0,ρ(4,+1,5)=5,ρ(4,−1,3)=0,ρ(5,±1,5)=0
1. 定义
定义1.1:
Argument of the maximum
——假设给定函数
f
(
x
)
f(x)
f(x)有最大
M
M
M,则
f
(
x
)
f(x)
f(x)达到
M
M
M的
x
x
x值集合表示为
arg max
x
f
(
x
)
\argmax _{x} f(x)
xargmaxf(x)
定义1.2:
一个随机变量的期望值——假设随机变量
X
X
X取
x
1
x_1
x1的概率为
p
1
p_1
p1,
x
2
x_2
x2的概率为
p
2
p_2
p2,……,则
X
X
X的期望为
E
[
X
]
=
x
1
p
1
+
x
2
p
2
+
.
.
.
+
x
k
p
k
\mathbb E[X]=x_1p_1+x_2p_2+...+x_kp_k
E[X]=x1p1+x2p2+...+xkpk
E
\mathbb E
E被称为exception operator
,并具有以下性质:
E
[
X
+
Y
]
=
E
[
X
]
+
E
[
Y
]
E
[
X
+
c
]
=
E
[
X
]
+
c
E
[
c
X
]
=
c
E
[
X
]
\mathbb E[X+Y]=\mathbb E[X]+\mathbb E[Y]\\ \mathbb E[X+c]=\mathbb E[X]+c\\ \mathbb E[cX]=c\mathbb E[X]
E[X+Y]=E[X]+E[Y]E[X+c]=E[X]+cE[cX]=cE[X]
定义1.3:
L
∞
L^\infty
L∞-norm:向量
x
=
[
x
1
,
x
2
,
.
.
.
,
x
n
]
T
\textbf x=[x_1,x_2,...,x_n]^T
x=[x1,x2,...,xn]T的
L
∞
L^\infty
L∞-norm,用
∣
∣
x
∣
∣
∞
||\textbf x||_\infty
∣∣x∣∣∞表示,是
x
\textbf x
x中最大的元素。
∣
∣
x
∣
∣
∞
=
Δ
max
i
∣
x
i
∣
||\textbf x||_\infty \overset{\Delta}{=} \max_i|x_i|
∣∣x∣∣∞=Δimax∣xi∣
2. 马尔科夫决策的要素
2.1 状态,动作,转换和奖励
状态:
S = { s 1 , s 2 , . . . , s ∣ S ∣ } S=\{s_1,s_2,...,s_{|S|}\} S={s1,s2,...,s∣S∣}
动作:
A ( s ) = { a 1 , a 2 , . . . , a ∣ A ∣ } A(s)=\{a_1,a_2,...,a_|A|\} A(s)={a1,a2,...,a∣A∣}
转换函数:
确定的转换函数:
f
ˉ
:
S
×
A
→
S
f
ˉ
(
s
,
a
)
=
s
′
\bar f : S×A\rightarrow S \\ \bar f(s,a)=s'
fˉ:S×A→Sfˉ(s,a)=s′
不确定的转换函数:
f
ˉ
:
S
×
A
×
S
→
[
0
,
1
]
f
(
s
,
a
,
s
′
)
=
P
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
=
p
s
s
′
a
\bar f: S×A×S\rightarrow [0,1] \\ f(s,a,s')=\mathbb P(S_{t+1}=s' | S_t=s,A_t=a)=p^a_{ss'}
fˉ:S×A×S→[0,1]f(s,a,s′)=P(St+1=s′∣St=s,At=a)=pss′a
奖励函数:
确定的奖励函数:
R
t
=
ρ
ˉ
(
S
t
−
1
,
A
t
−
1
,
S
t
∣
S
t
−
1
=
s
,
A
t
−
1
=
a
,
S
t
=
s
′
)
=
ρ
ˉ
(
s
,
a
,
s
′
)
=
r
ρ
ˉ
:
S
×
A
×
S
→
R
R_t=\bar\rho(S_{t-1},A_{t-1},S_t|S_{t-1}=s,A_{t-1}=a,S_t=s')=\bar\rho(s,a,s')=r \\ \bar\rho:S×A×S\rightarrow R
Rt=ρˉ(St−1,At−1,St∣St−1=s,At−1=a,St=s′)=ρˉ(s,a,s′)=rρˉ:S×A×S→R
状态决定的奖励函数:
ρ
(
r
∣
s
)
\rho(r|s)
ρ(r∣s)
ρ
:
R
×
S
→
[
0
,
1
]
\rho:R×S\rightarrow [0,1]
ρ:R×S→[0,1]
r
(
s
)
=
E
[
R
t
∣
S
t
−
1
=
s
]
=
r
1
ρ
(
r
1
∣
s
)
+
.
.
.
+
r
m
ρ
(
r
m
∣
s
)
=
∑
r
r
ρ
(
r
∣
s
)
r(s)=\mathbb E[R_t|S_{t-1}=s]=r_1\rho(r_1|s)+...+r_m\rho(r_m|s)=\sum_rr\rho(r|s)
r(s)=E[Rt∣St−1=s]=r1ρ(r1∣s)+...+rmρ(rm∣s)=r∑rρ(r∣s)
状态和动作决定的奖励函数:
ρ
(
r
∣
s
,
a
)
\rho(r|s,a)
ρ(r∣s,a)
ρ
:
R
×
S
×
A
→
[
0
,
1
]
\rho:R×S×A\rightarrow [0,1]
ρ:R×S×A→[0,1]
r
(
s
,
a
)
=
E
[
R
t
∣
S
t
−
1
=
s
,
A
t
−
1
=
a
]
=
r
1
ρ
(
r
1
∣
s
,
a
)
+
.
.
.
+
r
m
ρ
(
r
m
∣
s
,
a
)
=
∑
r
r
ρ
(
r
∣
s
,
a
)
\begin{aligned} r(s,a)&=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a]\\ &=r_1\rho(r_1|s,a)+...+r_m\rho(r_m|s,a)\\ &=\sum_rr\rho(r|s,a) \end{aligned}
r(s,a)=E[Rt∣St−1=s,At−1=a]=r1ρ(r1∣s,a)+...+rmρ(rm∣s,a)=r∑rρ(r∣s,a)
2.2 四参数的 p p p函数
p ( s ′ , r ∣ s , a ) = f ( s , a , s ′ ) ⋅ ρ ( r ∣ s ′ ) p : S × R × S × A → [ 0 , 1 ] p(s',r|s,a)=f(s,a,s') · \rho(r|s')\\ p:S×R×S×A\rightarrow [0,1] p(s′,r∣s,a)=f(s,a,s′)⋅ρ(r∣s′)p:S×R×S×A→[0,1]
利用上式可以推导:
概率转换函数
f ( s , a , s ′ ) = P S t = s ′ ∣ S t − 1 = s , A t − 1 = a = ∑ r ∈ R p ( s ′ , r ∣ s , a ) f(s,a,s')=\mathbb P{S_t=s'|S_{t-1}=s,A_{t-1}=a}=\sum_{r\in R}p(s',r|s,a) f(s,a,s′)=PSt=s′∣St−1=s,At−1=a=r∈R∑p(s′,r∣s,a)
状态决定的奖励函数
ρ
(
r
∣
s
′
)
=
f
(
s
,
a
,
s
′
)
⋅
ρ
(
r
∣
s
′
)
=
∑
r
∈
R
p
(
s
′
,
r
∣
s
,
a
)
⋅
ρ
(
r
∣
s
′
)
\rho(r|s')=f(s,a,s')·\rho(r|s')=\sum_{r\in R}p(s',r|s,a)·\rho(r|s')
ρ(r∣s′)=f(s,a,s′)⋅ρ(r∣s′)=r∈R∑p(s′,r∣s,a)⋅ρ(r∣s′)
ρ
(
r
∣
s
′
)
=
p
(
s
′
,
r
∣
s
,
a
)
∑
r
∈
R
p
(
s
′
,
r
∣
s
,
a
)
\rho(r|s')=\frac{p(s',r|s,a)}{\sum_{r\in R}p(s',r|s,a)}
ρ(r∣s′)=∑r∈Rp(s′,r∣s,a)p(s′,r∣s,a)
状态和动作决定的奖励函数
ρ ( r ∣ s , a ) = ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) \rho(r|s,a)=\sum_{s'\in S}p(s',r|s,a) ρ(r∣s,a)=s′∈S∑p(s′,r∣s,a)
基于状态的期望奖励
r ( s ) = E [ R t ∣ S t = s ] = ∑ r r ρ ( r ∣ s ) r(s)=\mathbb E[R_t|S_t=s]=\sum_rr\rho(r|s) r(s)=E[Rt∣St=s]=r∑rρ(r∣s)
基于状态和动作的期望奖励
r ( s , a ) = E [ R t ∣ S t − 1 = s , A t − 1 = a ] = ∑ r ∈ R ∑ s ′ ∈ S r p ( s ′ , r ∣ s , a ) r(s,a)=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a]=\sum_{r\in R}\sum_{s'\in S}rp(s',r|s,a) r(s,a)=E[Rt∣St−1=s,At−1=a]=r∈R∑s′∈S∑rp(s′,r∣s,a)
基于状态-动作-下一状态的期望奖励
r ( s , a , s ′ ) = E [ R t ∣ S t − 1 = s , A t − 1 = a , S t = s ′ ] = ∑ r r ρ ( r ∣ s ′ ) = ∑ r r p ( s ′ , r ∣ s , a ) ∑ r ∈ R p ( s ′ , r ∣ s , a ) \begin{aligned} r(s,a,s')&=\mathbb E[R_t|S_{t-1}=s,A_{t-1}=a,S_t=s']\\ &=\sum_rr\rho(r|s')\\ &=\sum_rr\frac{p(s',r|s,a)}{\sum_{r\in R}p(s',r|s,a)}\\ \end{aligned} r(s,a,s′)=E[Rt∣St−1=s,At−1=a,St=s′]=r∑rρ(r∣s′)=r∑r∑r∈Rp(s′,r∣s,a)p(s′,r∣s,a)
2.3 Return
奖励的总和:
G
t
=
Δ
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
.
.
.
=
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
G_t\overset{\Delta}{=}R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...=\sum_{k=0}^\infty\gamma^k R_{t+k+1}
Gt=ΔRt+1+γRt+2+γ2Rt+3+...=k=0∑∞γkRt+k+1
2.4 策略
π : S × A → [ 0 , 1 ] π ( a ∣ s ) = P { A t = a ∣ S t = s } \pi:S×A\rightarrow[0,1]\\ \pi(a|s)=\mathbb P\{A_t=a|S_t=s\} π:S×A→[0,1]π(a∣s)=P{At=a∣St=s}
3. 价值函数和贝尔曼方程
状态价值函数:
v π = Δ E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] , 对 于 所 有 的 s ∈ S \begin{aligned} v_\pi&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t=s]\\ &=\mathbb E_\pi [\sum_{k=0}^\infty\gamma^kR_{t+k+1}|S_t=s] ,对于所有的s\in S\\ \end{aligned} vπ=ΔEπ[Gt∣St=s]=Eπ[k=0∑∞γkRt+k+1∣St=s],对于所有的s∈S
动作价值函数:
q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] \begin{aligned} q_\pi(s,a)&=\mathbb E_\pi[G_t|S_t=s,A_t=a]\\ &=\mathbb E_\pi [\sum_{k=0}^\infty\gamma^kR_{t+k+1}|S_t=s,A_t=a] \\ \end{aligned} qπ(s,a)=Eπ[Gt∣St=s,At=a]=Eπ[k=0∑∞γkRt+k+1∣St=s,At=a]
贝尔曼方程:
v
π
=
Δ
E
π
[
G
t
∣
S
t
=
s
]
=
E
π
[
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
.
.
.
∣
S
t
=
s
]
=
E
π
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
π
[
(
R
t
+
2
+
γ
R
t
+
3
+
.
.
.
)
∣
S
t
=
s
]
=
E
π
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
π
[
G
t
+
1
∣
S
t
=
s
]
\begin{aligned} v_\pi&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t=s]\\ &=\mathbb E_\pi [R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...|S_t=s] \\ &=\mathbb E_\pi [R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[(R_{t+2}+\gamma R_{t+3}+...)|S_t=s]\\ &=\mathbb E_\pi[R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[G_{t+1}|S_t=s] \end{aligned}
vπ=ΔEπ[Gt∣St=s]=Eπ[Rt+1+γRt+2+γ2Rt+3+...∣St=s]=Eπ[Rt+1∣St=s]+γEπ[(Rt+2+γRt+3+...)∣St=s]=Eπ[Rt+1∣St=s]+γEπ[Gt+1∣St=s]
E
π
[
R
t
+
1
∣
S
t
=
s
]
=
∑
a
π
(
a
∣
s
)
∑
s
′
∑
r
r
⋅
p
(
s
′
,
r
∣
s
,
a
)
\mathbb E_\pi[R_{t+1}|S_t=s]=\sum_a\pi(a|s)\sum_{s'}\sum_rr·p(s',r|s,a)
Eπ[Rt+1∣St=s]=a∑π(a∣s)s′∑r∑r⋅p(s′,r∣s,a)
E
π
[
G
t
+
1
∣
S
t
=
s
]
=
∑
a
π
(
a
∣
s
)
∑
s
′
∑
r
p
(
s
′
,
r
∣
s
,
a
)
E
π
[
G
t
+
1
∣
S
t
+
1
=
s
′
]
=
∑
a
)
π
(
a
∣
s
)
∑
s
′
∑
r
p
(
s
′
,
r
∣
s
,
a
)
v
π
(
s
′
)
\begin{aligned} \mathbb E_\pi[G_{t+1}|S_t=s]&=\sum_a\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)\mathbb E_\pi [G_{t+1}|S_{t+1}=s']\\ &=\sum_{a)}\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)v_\pi(s')\\ \end{aligned}
Eπ[Gt+1∣St=s]=a∑π(a∣s)s′∑r∑p(s′,r∣s,a)Eπ[Gt+1∣St+1=s′]=a)∑π(a∣s)s′∑r∑p(s′,r∣s,a)vπ(s′)
v
π
(
s
)
=
E
π
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
π
[
G
t
+
1
∣
S
t
=
s
]
=
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
v
π
(
s
′
)
]
\begin{aligned} v_\pi(s) &=\mathbb E_\pi[R_{t+1}|S_t=s]+\gamma \mathbb E_\pi[G_{t+1}|S_t=s]\\ &=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]\\ \end{aligned}
vπ(s)=Eπ[Rt+1∣St=s]+γEπ[Gt+1∣St=s]=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]
q
π
(
s
,
a
)
=
Δ
E
π
[
G
t
∣
S
t
,
A
t
=
a
]
=
E
π
[
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
+
γ
E
π
[
G
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
E
π
[
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
=
∑
s
′
∑
r
r
⋅
p
(
s
′
,
r
∣
s
,
a
)
E
π
[
G
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
=
∑
s
′
∑
r
p
(
s
′
,
r
∣
s
,
a
)
π
(
a
′
∣
s
′
)
E
π
[
G
t
+
1
∣
S
t
+
1
=
s
′
,
A
t
+
1
=
a
′
]
=
∑
s
′
∑
r
p
(
s
′
,
r
∣
s
,
a
)
π
(
a
′
∣
s
′
)
q
π
(
s
′
,
a
′
)
\begin{aligned} q_\pi(s,a)&\overset{\Delta}{=}\mathbb E_\pi[G_t|S_t,A_t=a] \\ &=\mathbb E_\pi[R_{t+1}|S_t=s,A_t=a]+\gamma\mathbb E_\pi[G_{t+1}|S_t=s,A_t=a] \\ \mathbb E_\pi [R_{t+1}|S_t=s,A_t=a]&=\sum_{s'}\sum_rr·p(s',r|s,a) \\ \mathbb E_\pi[G_{t+1}|S_t=s,A_t=a]&=\sum_{s'}\sum_rp(s',r|s,a)\pi(a'|s')\mathbb E_\pi[G_{t+1}|S_{t+1}=s',A_{t+1}=a']\\ &=\sum_{s'}\sum_rp(s',r|s,a)\pi(a'|s')q_\pi(s',a')\\ \end{aligned}
qπ(s,a)Eπ[Rt+1∣St=s,At=a]Eπ[Gt+1∣St=s,At=a]=ΔEπ[Gt∣St,At=a]=Eπ[Rt+1∣St=s,At=a]+γEπ[Gt+1∣St=s,At=a]=s′∑r∑r⋅p(s′,r∣s,a)=s′∑r∑p(s′,r∣s,a)π(a′∣s′)Eπ[Gt+1∣St+1=s′,At+1=a′]=s′∑r∑p(s′,r∣s,a)π(a′∣s′)qπ(s′,a′)
q
π
(
s
,
a
)
=
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
π
(
a
′
∣
s
′
)
q
π
(
s
′
,
a
′
)
]
q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma \pi(a'|s')q_\pi(s',a')]
qπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γπ(a′∣s′)qπ(s′,a′)]
q
π
q_\pi
qπ贝尔曼方程:
q
π
(
s
,
a
)
=
E
π
[
R
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
+
γ
E
π
[
G
t
+
1
∣
S
t
=
s
,
A
t
=
a
]
=
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
q
π
(
s
′
,
a
′
)
]
\begin{aligned} q_\pi(s,a)&\overset{}{=}\mathbb E_\pi[R_{t+1}|S_t=s,A_t=a]+\gamma\mathbb E_\pi[G_{t+1}|S_t=s,A_t=a] \\ &=\sum_{s',r}p(s',r|s,a)[r+\gamma q_\pi(s',a')]\\ \end{aligned}
qπ(s,a)=Eπ[Rt+1∣St=s,At=a]+γEπ[Gt+1∣St=s,At=a]=s′,r∑p(s′,r∣s,a)[r+γqπ(s′,a′)]
v
π
(
s
)
v_\pi(s)
vπ(s)和
q
π
(
s
,
a
)
q_\pi(s,a)
qπ(s,a)的关系:
v
π
(
s
)
=
∑
a
π
(
a
∣
s
)
q
π
(
s
,
a
)
v_\pi(s)=\sum_a\pi(a|s)q_\pi(s,a)
vπ(s)=a∑π(a∣s)qπ(s,a)
如果策略
π
\pi
π是确定的,即
π
(
a
∣
s
)
=
1
\pi(a|s)=1
π(a∣s)=1
v
π
(
s
)
=
q
π
(
s
,
π
(
s
)
)
v_\pi(s)=q_\pi(s,\pi(s))
vπ(s)=qπ(s,π(s))
可以得出
q
π
(
s
,
a
)
=
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
v
π
(
s
′
)
]
q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]
qπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]
4. 最优策略和最优值函数
最优状态函数和最优动作函数:
v ∗ ( s ) = Δ m a x π v π ( s ) q ∗ ( s , a ) = Δ m a x π q π ( s , a ) v_*(s)\overset{\Delta}{=}\underset{\pi}{\rm{max}}v_\pi(s) \\ q_*(s,a)\overset{\Delta}{=}\underset{\pi}{\rm{max}}q_\pi(s,a) v∗(s)=Δπmaxvπ(s)q∗(s,a)=Δπmaxqπ(s,a)
贝尔曼最优方程:
v ∗ ( s ) = m a x a ∈ A ( s ) q ∗ ( s , a ) = m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ∗ ( s ′ ) ] q ∗ ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v ∗ ( s ′ ) ] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ m a x a ′ q ∗ ( s ′ , a ′ ) ] \begin{aligned} v_*(s)&=\underset{a\in A(s)}{\rm{max}}q_*(s,a)\\ &=\underset{a}{\rm{max}}\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]\\ q_*(s,a)&=\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]\\ &=\sum_{s',r}p(s',r|s,a)[r+\gamma \underset{a'}{\rm max}q_*(s',a')]\\ \end{aligned} v∗(s)q∗(s,a)=a∈A(s)maxq∗(s,a)=amaxs′,r∑p(s′,r∣s,a)[r+γv∗(s′)]=s′,r∑p(s′,r∣s,a)[r+γv∗(s′)]=s′,r∑p(s′,r∣s,a)[r+γa′maxq∗(s′,a′)]