【例子 -> return重要性】
问题:能否用数学工具描述从 s 1 s_1 s1出发,哪个策略是最好的?
回答:return可以评估一个策略
-
策略1:
return 1 = 0 + γ 1 + γ 2 1 + … = γ ( 1 + γ + γ 2 + … ) = γ 1 − γ \begin{aligned} \operatorname{return}_1 & =0+\gamma 1+\gamma^2 1+\ldots \\ & =\gamma\left(1+\gamma+\gamma^2+\ldots\right) \\ & =\frac{\gamma}{1-\gamma} \end{aligned} return1=0+γ1+γ21+…=γ(1+γ+γ2+…)=1−γγ -
策略2:
return 2 = − 1 + γ 1 + γ 2 1 + … , = − 1 + γ ( 1 + γ + γ 2 + … ) , = − 1 + γ 1 − γ . \begin{aligned} \text { return }_2 & =-1+\gamma 1+\gamma^2 1+\ldots, \\ & =-1+\gamma\left(1+\gamma+\gamma^2+\ldots\right), \\ & =-1+\frac{\gamma}{1-\gamma} . \end{aligned} return 2=−1+γ1+γ21+…,=−1+γ(1+γ+γ2+…),=−1+1−γγ. -
策略3:
return 3 = 0.5 ( − 1 + γ 1 − γ ) + 0.5 ( γ 1 − γ ) = − 0.5 + γ 1 − γ . \begin{aligned} \text { return }_3 & =0.5\left(-1+\frac{\gamma}{1-\gamma}\right)+0.5\left(\frac{\gamma}{1-\gamma}\right) \\ & =-0.5+\frac{\gamma}{1-\gamma} . \end{aligned} return 3=0.5(−1+1−γγ)+0.5(1−γγ)=−0.5+1−γγ.
return 1 > return 3 > return 2 \text { return }_1>\text { return }_3>\text { return }_2 return 1> return 3> return 2
✨return计算
方法1:
v
i
v_i
vi表示从
s
i
(
i
=
1
,
2
,
3
,
4
)
s_i(i=1,2,3,4)
si(i=1,2,3,4)出发的return
v
1
=
r
1
+
γ
r
2
+
γ
2
r
3
+
…
v
2
=
r
2
+
γ
r
3
+
γ
2
r
4
+
…
v
3
=
r
3
+
γ
r
4
+
γ
2
r
1
+
…
v
4
=
r
4
+
γ
r
1
+
γ
2
r
2
+
…
\begin{aligned} & v_1=r_1+\gamma r_2+\gamma^2 r_3+\ldots \\ & v_2=r_2+\gamma r_3+\gamma^2 r_4+\ldots \\ & v_3=r_3+\gamma r_4+\gamma^2 r_1+\ldots \\ & v_4=r_4+\gamma r_1+\gamma^2 r_2+\ldots \end{aligned}
v1=r1+γr2+γ2r3+…v2=r2+γr3+γ2r4+…v3=r3+γr4+γ2r1+…v4=r4+γr1+γ2r2+…
方法2(Bootstrapping):表明我们从不同状态出发得到的return,依赖于从其他状态出发得到的return
v
1
=
r
1
+
γ
(
r
2
+
γ
r
3
+
…
)
=
r
1
+
γ
v
2
v
2
=
r
2
+
γ
(
r
3
+
γ
r
4
+
…
)
=
r
2
+
γ
v
3
v
3
=
r
3
+
γ
(
r
4
+
γ
r
1
+
…
)
=
r
3
+
γ
v
4
v
4
=
r
4
+
γ
(
r
1
+
γ
r
2
+
…
)
=
r
4
+
γ
v
1
\begin{aligned} & v_1=r_1+\gamma\left(r_2+\gamma r_3+\ldots\right)=r_1+\gamma v_2 \\ & v_2=r_2+\gamma\left(r_3+\gamma r_4+\ldots\right)=r_2+\gamma v_3 \\ & v_3=r_3+\gamma\left(r_4+\gamma r_1+\ldots\right)=r_3+\gamma v_4 \\ & v_4=r_4+\gamma\left(r_1+\gamma r_2+\ldots\right)=r_4+\gamma v_1 \end{aligned}
v1=r1+γ(r2+γr3+…)=r1+γv2v2=r2+γ(r3+γr4+…)=r2+γv3v3=r3+γ(r4+γr1+…)=r3+γv4v4=r4+γ(r1+γr2+…)=r4+γv1
[ v 1 v 2 v 3 v 4 ] ⏟ v = [ r 1 r 2 r 3 r 4 ] + [ γ v 2 γ v 3 γ v 4 γ v 1 ] = [ r 1 r 2 r 3 r 4 ] ⏟ r + γ [ 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 ] ⏟ P [ v 1 v 2 v 3 v 4 ] ⏟ v \underbrace{\left[\begin{array}{l} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right]}_{\mathbf{v}}=\left[\begin{array}{l} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right]+\left[\begin{array}{l} \gamma v_2 \\ \gamma v_3 \\ \gamma v_4 \\ \gamma v_1 \end{array}\right]=\underbrace{\left[\begin{array}{l} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right]}_{\mathbf{r}}+\gamma \underbrace{\left[\begin{array}{llll} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{array}\right]}_{\mathbf{P}} \underbrace{\left[\begin{array}{l} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right]}_{\mathbf{v}} v v1v2v3v4 = r1r2r3r4 + γv2γv3γv4γv1 =r r1r2r3r4 +γP 0001100001000010 v v1v2v3v4
v = r + γ P v \mathbf{v}=\mathbf{r}+\gamma \mathbf{P} \mathbf{v} v=r+γPv
- 这个公式就是贝尔曼公式(针对这个特定的问题)
- 说明我一个状态的值依赖于其他状态的值
- 矩阵形式如何求解
【状态值(state value)】
✨单步过程
S t ⟶ A t R t + 1 , S t + 1 S_t \stackrel{A_t}{\longrightarrow} R_{t+1}, S_{t+1} St⟶AtRt+1,St+1
- t , t + 1 t,t+1 t,t+1:离散时间实例
- S t S_t St:时间 t t t 的状态
- A t A_t At:在状态 S t S_t St做的动作
- R t + 1 R_{t+1} Rt+1:在行为 A t A_t At后获得的奖励
- S t + 1 S_{t+1} St+1:执行动作 A t A_t At后的下一个状态
每一步受下列概率分布的制约:
- S t → A t S_t \rightarrow A_t St→At: π ( A t = a ∣ S t = s ) \pi\left(A_t=a \mid S_t=s\right) π(At=a∣St=s)
- S t , A t → R t + 1 S_t, A_t \rightarrow R_{t+1} St,At→Rt+1:reward probability p ( R t + 1 = r ∣ S t = s , A t = a ) p\left(R_{t+1}=r \mid S_t=s, A_t=a\right) p(Rt+1=r∣St=s,At=a)
- S t , A t → S t + 1 S_t, A_t \rightarrow S_{t+1} St,At→St+1:state transition probability p ( S t + 1 = s ′ ∣ S t = s , A t = a ) p\left(S_{t+1}=s^{\prime} \mid S_t=s, A_t=a\right) p(St+1=s′∣St=s,At=a)
✨多步过程
S t ⟶ A t R t + 1 , S t + 1 ⟶ A t + 1 R t + 2 , S t + 2 ⟶ A t + 2 R t + 3 , … S_t \stackrel{A_t}{\longrightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\longrightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\longrightarrow} R_{t+3}, \ldots St⟶AtRt+1,St+1⟶At+1Rt+2,St+2⟶At+2Rt+3,…
- discounted return: G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots Gt=Rt+1+γRt+2+γ2Rt+3+…
✨状态值
G
t
G_t
Gt是对一个 trajectory 的 discounted return ,state value 是对
G
t
G_t
Gt 的一个期望值
v
π
(
s
)
=
E
[
G
t
∣
S
t
=
s
]
v_\pi(s)=\mathbb{E}\left[G_t \mid S_t=s\right]
vπ(s)=E[Gt∣St=s]
- v π ( s ) v_\pi(s) vπ(s) 是从 s s s 出发,从不同的轨迹出发其 G t G_t Gt 也是不同的
- v π ( s ) v_\pi(s) vπ(s) 是依赖于不同的 π \pi π 的,从不同的策略走其轨迹不同其 G t G_t Gt 也是不同的
- state value 不仅仅是数值而且是价值,价值越大得到更多的return
问题:return 和 state value 有什么不同
回答:
return 是针对单个 trajectory 得到的 return
state value 是对多个 trajectory 得到return 再求平均值
加入从一个状态出发会有多个trajectory 那么这两个是显然有区别的;加入从一个状态出发只有一个trajectory 那么这两个是一样的
【贝尔曼公式推导】
✨定义:
它描述了不同状态的state value之间的关系
✨推导:
对于随机的一个trajectory :
S
t
⟶
A
t
R
t
+
1
,
S
t
+
1
⟶
A
t
+
1
R
t
+
2
,
S
t
+
2
⟶
A
t
+
2
R
t
+
3
,
…
S_t \stackrel{A_t}{\longrightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\longrightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\longrightarrow} R_{t+3}, \ldots
St⟶AtRt+1,St+1⟶At+1Rt+2,St+2⟶At+2Rt+3,…
其return
G
t
G_t
Gt是:
G
t
=
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
…
,
=
R
t
+
1
+
γ
(
R
t
+
2
+
γ
R
t
+
3
+
…
)
,
=
R
t
+
1
+
γ
G
t
+
1
,
\begin{aligned} G_t & =R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots, \\ & =R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right), \\ & =R_{t+1}+\gamma G_{t+1}, \end{aligned}
Gt=Rt+1+γRt+2+γ2Rt+3+…,=Rt+1+γ(Rt+2+γRt+3+…),=Rt+1+γGt+1,
其state value是:
v
π
(
s
)
=
E
[
G
t
∣
S
t
=
s
]
=
E
[
R
t
+
1
+
γ
G
t
+
1
∣
S
t
=
s
]
=
E
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
[
G
t
+
1
∣
S
t
=
s
]
\begin{aligned} v_\pi(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right] \end{aligned}
vπ(s)=E[Gt∣St=s]=E[Rt+1+γGt+1∣St=s]=E[Rt+1∣St=s]+γE[Gt+1∣St=s]
我们分别分析其两个均值:
-
对于第一个 E [ R t + 1 ∣ S t = s ] : \mathbb{E}\left[R_{t+1} \mid S_t=s\right]: E[Rt+1∣St=s]:
E [ R t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r \begin{aligned} \mathbb{E}\left[R_{t+1} \mid S_t=s\right] & =\sum_a \pi(a \mid s) \mathbb{E}\left[R_{t+1} \mid S_t=s, A_t=a\right] \\ & =\sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r \end{aligned} E[Rt+1∣St=s]=a∑π(a∣s)E[Rt+1∣St=s,At=a]=a∑π(a∣s)r∑p(r∣s,a)r -
对于第二个 E [ G t + 1 ∣ S t = s ] : \mathbb{E}\left[G_{t+1} \mid S_t=s\right]: E[Gt+1∣St=s]:它是未来奖励的均值
E [ G t + 1 ∣ S t = s ] = ∑ s ′ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) ∑ a p ( s ′ ∣ s , a ) π ( a ∣ s ) \begin{aligned} \mathbb{E}\left[G_{t+1} \mid S_t=s\right] & =\sum_{s^{\prime}} \mathbb{E}\left[G_{t+1} \mid S_t=s, S_{t+1}=s^{\prime}\right] p\left(s^{\prime} \mid s\right) \\ & =\sum_{s^{\prime}} \mathbb{E}\left[G_{t+1} \mid S_{t+1}=s^{\prime}\right] p\left(s^{\prime} \mid s\right) \\ & =\sum_{s^{\prime}} v_\pi\left(s^{\prime}\right) p\left(s^{\prime} \mid s\right) \\ & =\sum_{s^{\prime}} v_\pi\left(s^{\prime}\right) \sum_a p\left(s^{\prime} \mid s, a\right) \pi(a \mid s) \end{aligned} E[Gt+1∣St=s]=s′∑E[Gt+1∣St=s,St+1=s′]p(s′∣s)=s′∑E[Gt+1∣St+1=s′]p(s′∣s)=s′∑vπ(s′)p(s′∣s)=s′∑vπ(s′)a∑p(s′∣s,a)π(a∣s)
贝尔曼公示的表达式:该式子对应于状态空间所有的状态都成立
v
π
(
s
)
=
E
[
R
t
+
1
∣
S
t
=
s
]
+
γ
E
[
G
t
+
1
∣
S
t
=
s
]
,
=
∑
a
π
(
a
∣
s
)
∑
r
p
(
r
∣
s
,
a
)
r
⏟
mean of immediate rewards
+
γ
∑
a
π
(
a
∣
s
)
∑
s
′
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
,
⏟
mean of future rewards
=
∑
a
π
(
a
∣
s
)
[
∑
r
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
]
,
∀
s
∈
S
.
\begin{aligned} v_\pi(s) & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right], \\ & =\underbrace{\sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r}_{\text {mean of immediate rewards }}+\underbrace{\gamma \sum_a \pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right),}_{\text {mean of future rewards }} \\ & =\sum_a \pi(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right], \quad \forall s \in \mathcal{S} . \end{aligned}
vπ(s)=E[Rt+1∣St=s]+γE[Gt+1∣St=s],=mean of immediate rewards
a∑π(a∣s)r∑p(r∣s,a)r+mean of future rewards
γa∑π(a∣s)s′∑p(s′∣s,a)vπ(s′),=a∑π(a∣s)[r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπ(s′)],∀s∈S.
✨例子1:
我们将这个图中所有的贝尔曼公式写出来:
s
1
s1
s1:
π
(
a
=
a
3
∣
s
1
)
=
1
and
π
(
a
≠
a
3
∣
s
1
)
=
0.
p
(
s
′
=
s
3
∣
s
1
,
a
3
)
=
1
and
p
(
s
′
≠
s
3
∣
s
1
,
a
3
)
=
0.
p
(
r
=
0
∣
s
1
,
a
3
)
=
1
and
p
(
r
≠
0
∣
s
1
,
a
3
)
=
0.
\begin{aligned} & \pi\left(a=a_3 \mid s_1\right)=1 \text { and } \pi\left(a \neq a_3 \mid s_1\right)=0 . \\ & p\left(s^{\prime}=s_3 \mid s_1, a_3\right)=1 \text { and } p\left(s^{\prime} \neq s_3 \mid s_1, a_3\right)=0 . \\ & p\left(r=0 \mid s_1, a_3\right)=1 \text { and } p\left(r \neq 0 \mid s_1, a_3\right)=0 . \end{aligned}
π(a=a3∣s1)=1 and π(a=a3∣s1)=0.p(s′=s3∣s1,a3)=1 and p(s′=s3∣s1,a3)=0.p(r=0∣s1,a3)=1 and p(r=0∣s1,a3)=0.
v
π
(
s
1
)
=
0
+
γ
v
π
(
s
3
)
v_\pi\left(s_1\right)=0+\gamma v_\pi\left(s_3\right)
vπ(s1)=0+γvπ(s3)
同理得到:
v
π
(
s
1
)
=
0
+
γ
v
π
(
s
3
)
,
v
π
(
s
2
)
=
1
+
γ
v
π
(
s
4
)
v
π
(
s
3
)
=
1
+
γ
v
π
(
s
4
)
v
π
(
s
4
)
=
1
+
γ
v
π
(
s
4
)
.
\begin{aligned} & v_\pi\left(s_1\right)=0+\gamma v_\pi\left(s_3\right), \\ & v_\pi\left(s_2\right)=1+\gamma v_\pi\left(s_4\right) \\ & v_\pi\left(s_3\right)=1+\gamma v_\pi\left(s_4\right) \\ & v_\pi\left(s_4\right)=1+\gamma v_\pi\left(s_4\right) . \end{aligned}
vπ(s1)=0+γvπ(s3),vπ(s2)=1+γvπ(s4)vπ(s3)=1+γvπ(s4)vπ(s4)=1+γvπ(s4).
通过求解得到:
v
π
(
s
4
)
=
1
1
−
γ
,
v
π
(
s
3
)
=
1
1
−
γ
,
v
π
(
s
2
)
=
1
1
−
γ
,
v
π
(
s
1
)
=
γ
1
−
γ
.
\begin{aligned} & v_\pi\left(s_4\right)=\frac{1}{1-\gamma}, \\ & v_\pi\left(s_3\right)=\frac{1}{1-\gamma}, \\ & v_\pi\left(s_2\right)=\frac{1}{1-\gamma}, \\ & v_\pi\left(s_1\right)=\frac{\gamma}{1-\gamma} . \end{aligned}
vπ(s4)=1−γ1,vπ(s3)=1−γ1,vπ(s2)=1−γ1,vπ(s1)=1−γγ.
假设
γ
=
0.9
\gamma=0.9
γ=0.9得到:
v
π
(
s
4
)
=
1
1
−
0.9
=
10
v
π
(
s
3
)
=
1
1
−
0.9
=
10
v
π
(
s
2
)
=
1
1
−
0.9
=
10
v
π
(
s
1
)
=
0.9
1
−
0.9
=
9
\begin{aligned} & v_\pi\left(s_4\right)=\frac{1}{1-0.9}=10 \\ & v_\pi\left(s_3\right)=\frac{1}{1-0.9}=10 \\ & v_\pi\left(s_2\right)=\frac{1}{1-0.9}=10 \\ & v_\pi\left(s_1\right)=\frac{0.9}{1-0.9}=9 \end{aligned}
vπ(s4)=1−0.91=10vπ(s3)=1−0.91=10vπ(s2)=1−0.91=10vπ(s1)=1−0.90.9=9
假设一个状态价值高则说明有价值
✨例子2:
其贝尔曼公式:
v
π
(
s
1
)
=
0.5
[
0
+
γ
v
π
(
s
3
)
]
+
0.5
[
−
1
+
γ
v
π
(
s
2
)
]
,
v
π
(
s
2
)
=
1
+
γ
v
π
(
s
4
)
,
v
π
(
s
3
)
=
1
+
γ
v
π
(
s
4
)
,
v
π
(
s
4
)
=
1
+
γ
v
π
(
s
4
)
.
\begin{aligned} & v_\pi\left(s_1\right)=0.5\left[0+\gamma v_\pi\left(s_3\right)\right]+0.5\left[-1+\gamma v_\pi\left(s_2\right)\right], \\ & v_\pi\left(s_2\right)=1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_3\right)=1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_4\right)=1+\gamma v_\pi\left(s_4\right) . \end{aligned}
vπ(s1)=0.5[0+γvπ(s3)]+0.5[−1+γvπ(s2)],vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),vπ(s4)=1+γvπ(s4).
对其求解:
v
π
(
s
4
)
=
1
1
−
γ
,
v
π
(
s
3
)
=
1
1
−
γ
,
v
π
(
s
2
)
=
1
1
−
γ
v
π
(
s
1
)
=
0.5
[
0
+
γ
v
π
(
s
3
)
]
+
0.5
[
−
1
+
γ
v
π
(
s
2
)
]
=
−
0.5
+
γ
1
−
γ
.
\begin{aligned} v_\pi\left(s_4\right) & =\frac{1}{1-\gamma}, \quad v_\pi\left(s_3\right)=\frac{1}{1-\gamma}, \quad v_\pi\left(s_2\right)=\frac{1}{1-\gamma} \\ v_\pi\left(s_1\right) & =0.5\left[0+\gamma v_\pi\left(s_3\right)\right]+0.5\left[-1+\gamma v_\pi\left(s_2\right)\right] \\ & =-0.5+\frac{\gamma}{1-\gamma} . \end{aligned}
vπ(s4)vπ(s1)=1−γ1,vπ(s3)=1−γ1,vπ(s2)=1−γ1=0.5[0+γvπ(s3)]+0.5[−1+γvπ(s2)]=−0.5+1−γγ.
假设
γ
=
0.9
\gamma=0.9
γ=0.9得到:
v
π
(
s
4
)
=
10
,
v
π
(
s
3
)
=
10
,
v
π
(
s
2
)
=
10
,
v
π
(
s
1
)
=
−
0.5
+
9
=
8.5.
v_\pi\left(s_4\right)=10, \quad v_\pi\left(s_3\right)=10, \quad v_\pi\left(s_2\right)=10, \quad v_\pi\left(s_1\right)=-0.5+9=8.5 .
vπ(s4)=10,vπ(s3)=10,vπ(s2)=10,vπ(s1)=−0.5+9=8.5.
通过观察发现其
s
1
s1
s1 的 state value 是8.5没有刚才那个策略好
【贝尔曼公式矩阵向量形式】
v
π
(
s
)
=
∑
a
π
(
a
∣
s
)
[
∑
r
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
]
v_\pi(s)=\sum_a \pi(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right]
vπ(s)=a∑π(a∣s)[r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπ(s′)]
对贝尔曼公式进行重写得到:
v
π
(
s
)
=
r
π
(
s
)
+
γ
∑
s
′
p
π
(
s
′
∣
s
)
v
π
(
s
′
)
v_\pi(s)=r_\pi(s)+\gamma \sum_{s^{\prime}} p_\pi\left(s^{\prime} \mid s\right) v_\pi\left(s^{\prime}\right)
vπ(s)=rπ(s)+γs′∑pπ(s′∣s)vπ(s′)
- r π ( s ) ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r r_\pi(s) \triangleq \sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r rπ(s)≜∑aπ(a∣s)∑rp(r∣s,a)r :为立即奖励的平均值
- p π ( s ′ ∣ s ) ≜ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) p_\pi\left(s^{\prime} \mid s\right) \triangleq \sum_a \pi(a \mid s) p\left(s^{\prime} \mid s, a\right) pπ(s′∣s)≜∑aπ(a∣s)p(s′∣s,a) :为从 s s s 转到 s ′ s' s′ 的概率
对状态从
s
i
(
i
=
1
,
…
,
n
)
s_i(i=1, \ldots, n)
si(i=1,…,n),则贝尔曼公式为:
v
π
(
s
i
)
=
r
π
(
s
i
)
+
γ
∑
s
j
p
π
(
s
j
∣
s
i
)
v
π
(
s
j
)
v_\pi\left(s_i\right)=r_\pi\left(s_i\right)+\gamma \sum_{s_j} p_\pi\left(s_j \mid s_i\right) v_\pi\left(s_j\right)
vπ(si)=rπ(si)+γsj∑pπ(sj∣si)vπ(sj)
转为矩阵向量形式:
v
π
=
r
π
+
γ
P
π
v
π
v_\pi=r_\pi+\gamma P_\pi v_\pi
vπ=rπ+γPπvπ
-
v π = [ v π ( s 1 ) , … , v π ( s n ) ] T ∈ R n v_\pi=\left[v_\pi\left(s_1\right), \ldots, v_\pi\left(s_n\right)\right]^T \in \mathbb{R}^n vπ=[vπ(s1),…,vπ(sn)]T∈Rn
-
r π = [ r π ( s 1 ) , … , r π ( s n ) ] T ∈ R n r_\pi=\left[r_\pi\left(s_1\right), \ldots, r_\pi\left(s_n\right)\right]^T \in \mathbb{R}^n rπ=[rπ(s1),…,rπ(sn)]T∈Rn
-
P π ∈ R n × n P_\pi \in \mathbb{R}^{n \times n} Pπ∈Rn×n, 其中 [ P π ] i j = p π ( s j ∣ s i ) \left[P_\pi\right]_{i j}=p_\pi\left(s_j \mid s_i\right) [Pπ]ij=pπ(sj∣si)为状态转移矩阵
假如有四个状态 v π = r π + γ P π v π v_\pi=r_\pi+\gamma P_\pi v_\pi vπ=rπ+γPπvπ可以写成:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π = [ r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) ] ⏟ r π + γ [ p π ( s 1 ∣ s 1 ) p π ( s 2 ∣ s 1 ) p π ( s 3 ∣ s 1 ) p π ( s 4 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 2 ∣ s 2 ) p π ( s 3 ∣ s 2 ) p π ( s 4 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 2 ∣ s 3 ) p π ( s 3 ∣ s 3 ) p π ( s 4 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 4 ) ] ⏟ P π [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π \underbrace{\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]}_{v_\pi}=\underbrace{\left[\begin{array}{l} r_\pi\left(s_1\right) \\ r_\pi\left(s_2\right) \\ r_\pi\left(s_3\right) \\ r_\pi\left(s_4\right) \end{array}\right]}_{r_\pi}+\gamma \quad \underbrace{\left[\begin{array}{llll} p_\pi\left(s_1 \mid s_1\right) & p_\pi\left(s_2 \mid s_1\right) & p_\pi\left(s_3 \mid s_1\right) & p_\pi\left(s_4 \mid s_1\right) \\ p_\pi\left(s_1 \mid s_2\right) & p_\pi\left(s_2 \mid s_2\right) & p_\pi\left(s_3 \mid s_2\right) & p_\pi\left(s_4 \mid s_2\right) \\ p_\pi\left(s_1 \mid s_3\right) & p_\pi\left(s_2 \mid s_3\right) & p_\pi\left(s_3 \mid s_3\right) & p_\pi\left(s_4 \mid s_3\right) \\ p_\pi\left(s_1 \mid s_4\right) & p_\pi\left(s_2 \mid s_4\right) & p_\pi\left(s_3 \mid s_4\right) & p_\pi\left(s_4 \mid s_4\right) \end{array}\right]}_{P_\pi} \underbrace{\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]}_{v_\pi} vπ vπ(s1)vπ(s2)vπ(s3)vπ(s4) =rπ rπ(s1)rπ(s2)rπ(s3)rπ(s4) +γPπ pπ(s1∣s1)pπ(s1∣s2)pπ(s1∣s3)pπ(s1∣s4)pπ(s2∣s1)pπ(s2∣s2)pπ(s2∣s3)pπ(s2∣s4)pπ(s3∣s1)pπ(s3∣s2)pπ(s3∣s3)pπ(s3∣s4)pπ(s4∣s1)pπ(s4∣s2)pπ(s4∣s3)pπ(s4∣s4) vπ vπ(s1)vπ(s2)vπ(s3)vπ(s4)
✨例子1:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0 1 1 1 ] + γ [ 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]=\left[\begin{array}{l} 0 \\ 1 \\ 1 \\ 1 \end{array}\right]+\gamma\left[\begin{array}{llll} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array}\right]\left[\begin{array}{l} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right] vπ(s1)vπ(s2)vπ(s3)vπ(s4) = 0111 +γ 0000000010000111 vπ(s1)vπ(s2)vπ(s3)vπ(s4)
✨例子2:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0.5 ( 0 ) + 0.5 ( − 1 ) 1 1 1 ] + γ [ 0 0.5 0.5 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \left[\begin{array}{c} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right]=\left[\begin{array}{c} 0.5(0)+0.5(-1) \\ 1 \\ 1 \\ 1 \end{array}\right]+\gamma\left[\begin{array}{cccc} 0 & 0.5 & 0.5 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array}\right]\left[\begin{array}{c} v_\pi\left(s_1\right) \\ v_\pi\left(s_2\right) \\ v_\pi\left(s_3\right) \\ v_\pi\left(s_4\right) \end{array}\right] vπ(s1)vπ(s2)vπ(s3)vπ(s4) = 0.5(0)+0.5(−1)111 +γ 00000.50000.50000111 vπ(s1)vπ(s2)vπ(s3)vπ(s4)
【贝尔曼公式求解状态值】
策略评估:是强化学习中非常关键的工具,我们只有评价一个策略好不好才能进一步改进找到最优的策略
v
π
=
r
π
+
γ
P
π
v
π
v_\pi=r_\pi+\gamma P_\pi v_\pi
vπ=rπ+γPπvπ
✨求解方法1(closed-form solution):
v π = ( I − γ P π ) − 1 r π v_\pi=\left(I-\gamma P_\pi\right)^{-1} r_\pi vπ=(I−γPπ)−1rπ
通过直接求解,但是这种方法需要求解其逆矩阵,所以一般不采用
✨求解方法2(iterative solution):
v k + 1 = r π + γ P π v k v_{k+1}=r_\pi+\gamma P_\pi v_k vk+1=rπ+γPπvk
我们发现:
v
k
→
v
π
=
(
I
−
γ
P
π
)
−
1
r
π
,
k
→
∞
v_k \rightarrow v_\pi=\left(I-\gamma P_\pi\right)^{-1} r_\pi, \quad k \rightarrow \infty
vk→vπ=(I−γPπ)−1rπ,k→∞
✨例子1:
这两个例子就很好,用了不同的策略但是最终的result一样
✨例子2:
这两个策略不好, 两个state value都是负的。
我们可以计算state value来评价一个策略好还是不好
【动作值(action value)】
-
state value:agent从一个状态出发所得到的average return
-
action value:agent从一个状态出发并且选择了一个action后得到的average return
在强化学习中,我们选择怎样的策略,策略指的是在一个状态我要选择什么样的action,action value可以判断选择哪些。
✨定义:
q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s, a)=\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right] qπ(s,a)=E[Gt∣St=s,At=a]
- q π ( s , a ) q_\pi(s, a) qπ(s,a):依赖于从哪个状态出发选择怎样的action;其次他也依赖于策略 π \pi π
✨state value与action value关系:
E [ G t ∣ S t = s ] ⏟ v π ( s ) = ∑ a E [ G t ∣ S t = s , A t = a ] ⏟ q π ( s , a ) π ( a ∣ s ) v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) \begin{aligned} &\underbrace{\mathbb{E}\left[G_t \mid S_t=s\right]}_{v_\pi(s)}=\sum_a \underbrace{\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right]}_{q_\pi(s, a)} \pi(a \mid s)\\ &v_\pi(s)=\sum_a \pi(a \mid s) q_\pi(s, a) \end{aligned} vπ(s) E[Gt∣St=s]=a∑qπ(s,a) E[Gt∣St=s,At=a]π(a∣s)vπ(s)=a∑π(a∣s)qπ(s,a)
state value:我有许多个action,我选择不同action得到的action value的平均值
根据之前的贝尔曼公式:
v
π
(
s
)
=
∑
a
π
(
a
∣
s
)
[
∑
r
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
⏟
q
π
(
s
,
a
)
]
v_\pi(s)=\sum_a \pi(a \mid s)[\underbrace{\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)}_{q_\pi(s, a)}]
vπ(s)=a∑π(a∣s)[qπ(s,a)
r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπ(s′)]
我们得到action value的表达式:
q
π
(
s
,
a
)
=
∑
r
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
q_\pi(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)
qπ(s,a)=r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπ(s′)
✨例子:
s 1 s_1 s1的action value : q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) q_\pi\left(s_1, a_2\right)=-1+\gamma v_\pi\left(s_2\right) qπ(s1,a2)=−1+γvπ(s2)
问题: q π ( s 1 , a 1 ‾ ) , q π ( s 1 , a 3 ‾ ) , q π ( s 1 , a 4 ‾ ) , q π ( s 1 , a 5 ‾ ) = ? q_\pi\left(s_1, \underline{a_1}\right), q_\pi\left(s_1, \underline{a_3}\right), q_\pi\left(s_1, \underline{a_4}\right), q_\pi\left(s_1, \underline{a_5}\right)=? qπ(s1,a1),qπ(s1,a3),qπ(s1,a4),qπ(s1,a5)=?
回答:虽然现在策略告诉我们往右走但是该状态不一定好,实际上所以的action都可以计算的
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) \begin{aligned} & q_\pi\left(s_1, a_1\right)=-1+\gamma v_\pi\left(s_1\right) \\ & q_\pi\left(s_1, a_3\right)=0+\gamma v_\pi\left(s_3\right) \\ & q_\pi\left(s_1, a_4\right)=-1+\gamma v_\pi\left(s_1\right) \\ & q_\pi\left(s_1, a_5\right)=0+\gamma v_\pi\left(s_1\right) \end{aligned} qπ(s1,a1)=−1+γvπ(s1)qπ(s1,a3)=0+γvπ(s3)qπ(s1,a4)=−1+γvπ(s1)qπ(s1,a5)=0+γvπ(s1)