- 两个概念:最优状态值,最优策略
- 一个工具:贝尔曼最优公式(BOE)
【例子-如何改进策略】
贝尔曼等式:
v
π
(
s
1
)
=
−
1
+
γ
v
π
(
s
2
)
,
v
π
(
s
2
)
=
+
1
+
γ
v
π
(
s
4
)
,
v
π
(
s
3
)
=
+
1
+
γ
v
π
(
s
4
)
,
v
π
(
s
4
)
=
+
1
+
γ
v
π
(
s
4
)
.
\begin{aligned} & v_\pi\left(s_1\right)=-1+\gamma v_\pi\left(s_2\right), \\ & v_\pi\left(s_2\right)=+1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_3\right)=+1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_4\right)=+1+\gamma v_\pi\left(s_4\right) . \end{aligned}
vπ(s1)=−1+γvπ(s2),vπ(s2)=+1+γvπ(s4),vπ(s3)=+1+γvπ(s4),vπ(s4)=+1+γvπ(s4).
假设
γ
=
0.9
\gamma=0.9
γ=0.9,我们能够计算得到
v
π
(
s
4
)
=
v
π
(
s
3
)
=
v
π
(
s
2
)
=
10
,
v
π
(
s
1
)
=
8
v_\pi\left(s_4\right)=v_\pi\left(s_3\right)=v_\pi\left(s_2\right)=10, \quad v_\pi\left(s_1\right)=8
vπ(s4)=vπ(s3)=vπ(s2)=10,vπ(s1)=8
我们计算
s
1
s_1
s1的action value:
q
π
(
s
1
,
a
1
)
=
−
1
+
γ
v
π
(
s
1
)
=
6.2
,
q
π
(
s
1
,
a
2
)
=
−
1
+
γ
v
π
(
s
2
)
=
8
,
q
π
(
s
1
,
a
3
)
=
0
+
γ
v
π
(
s
3
)
=
9
,
q
π
(
s
1
,
a
4
)
=
−
1
+
γ
v
π
(
s
1
)
=
6.2
,
q
π
(
s
1
,
a
5
)
=
0
+
γ
v
π
(
s
1
)
=
7.2.
\begin{aligned} & q_\pi\left(s_1, a_1\right)=-1+\gamma v_\pi\left(s_1\right)=6.2, \\ & q_\pi\left(s_1, a_2\right)=-1+\gamma v_\pi\left(s_2\right)=8, \\ & q_\pi\left(s_1, a_3\right)=0+\gamma v_\pi\left(s_3\right)=9, \\ & q_\pi\left(s_1, a_4\right)=-1+\gamma v_\pi\left(s_1\right)=6.2, \\ & q_\pi\left(s_1, a_5\right)=0+\gamma v_\pi\left(s_1\right)=7.2 . \end{aligned}
qπ(s1,a1)=−1+γvπ(s1)=6.2,qπ(s1,a2)=−1+γvπ(s2)=8,qπ(s1,a3)=0+γvπ(s3)=9,qπ(s1,a4)=−1+γvπ(s1)=6.2,qπ(s1,a5)=0+γvπ(s1)=7.2.
问题:当前策略不太好,我们如何进行改进?
回答:使用action value,当前策略如下:
π ( a ∣ s 1 ) = { 1 a = a 2 0 a ≠ a 2 \pi\left(a \mid s_1\right)= \begin{cases}1 & a=a_2 \\ 0 & a \neq a_2\end{cases} π(a∣s1)={10a=a2a=a2
计算action value:
q π ( s 1 , a 1 ) = 6.2 , q π ( s 1 , a 2 ) = 8 , q π ( s 1 , a 3 ) = 9 q π ( s 1 , a 4 ) = 6.2 , q π ( s 1 , a 5 ) = 7.2. \begin{aligned} & q_\pi\left(s_1, a_1\right)=6.2, q_\pi\left(s_1, a_2\right)=8, q_\pi\left(s_1, a_3\right)=9 \\ & q_\pi\left(s_1, a_4\right)=6.2, q_\pi\left(s_1, a_5\right)=7.2 . \end{aligned} qπ(s1,a1)=6.2,qπ(s1,a2)=8,qπ(s1,a3)=9qπ(s1,a4)=6.2,qπ(s1,a5)=7.2.
如果我们选择最大的action value( a ∗ = arg max a q π ( s 1 , a ) = a 3 a^*=\arg \max _a q_\pi\left(s_1, a\right)=a_3 a∗=argmaxaqπ(s1,a)=a3),一个新的政策如下(往下走):
π new ( a ∣ s 1 ) = { 1 a = a ∗ 0 a ≠ a ∗ \pi_{\text {new }}\left(a \mid s_1\right)= \begin{cases}1 & a=a^* \\ 0 & a \neq a^*\end{cases} πnew (a∣s1)={10a=a∗a=a∗
发现确实使用 a 3 a_3 a3策略的时候效果更好
【最优策略的定义】
state value能够用来衡量一个策略是好还是不好,如果满足下面式子,则表明
π
1
\pi_1
π1比
π
2
\pi_2
π2好
v
π
1
(
s
)
≥
v
π
2
(
s
)
for all
s
∈
S
v_{\pi_1}(s) \geq v_{\pi_2}(s) \quad \text { for all } s \in \mathcal{S}
vπ1(s)≥vπ2(s) for all s∈S
✨定义:
一个策略 π ∗ \pi^* π∗是最优的:对于所有 s s s 和所有其他策略 π \pi π 的情况下 v π ∗ ( s ) ≥ v π ( s ) v_{\pi^*}(s) \geq v_\pi(s) vπ∗(s)≥vπ(s)
【贝尔曼最优公式(BOE)】
贝尔曼公式:
v
(
s
)
=
∑
a
π
(
a
∣
s
)
(
∑
r
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
v
(
s
′
)
)
,
∀
s
∈
S
v(s)=\quad \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right)\right), \quad \forall s \in \mathcal{S}
v(s)=a∑π(a∣s)(r∑p(r∣s,a)r+γs′∑p(s′∣s,a)v(s′)),∀s∈S
贝尔曼最优公式:在
π
\pi
π 前面加上了
max
π
\max _\pi
maxπ,嵌套了一个优化问题
v
(
s
)
=
max
π
∑
a
π
(
a
∣
s
)
(
∑
r
p
(
r
∣
s
,
a
)
r
+
γ
∑
s
′
p
(
s
′
∣
s
,
a
)
v
(
s
′
)
)
,
∀
s
∈
S
=
max
π
∑
a
π
(
a
∣
s
)
q
(
s
,
a
)
s
∈
S
\begin{aligned} v(s) & =\max _\pi \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right)\right), \quad \forall s \in \mathcal{S} \\ & =\max _\pi \sum_a \pi(a \mid s) q(s, a) \quad s \in \mathcal{S} \end{aligned}
v(s)=πmaxa∑π(a∣s)(r∑p(r∣s,a)r+γs′∑p(s′∣s,a)v(s′)),∀s∈S=πmaxa∑π(a∣s)q(s,a)s∈S
- p ( r ∣ s , a ) , p ( s ′ ∣ s , a ) p(r \mid s, a), p\left(s^{\prime} \mid s, a\right) p(r∣s,a),p(s′∣s,a):知道
- v ( s ) , v ( s ′ ) v(s), v\left(s^{\prime}\right) v(s),v(s′):不知道需要计算的
矩阵向量形式:
v
=
max
π
(
r
π
+
γ
P
π
v
)
v=\max _\pi\left(r_\pi+\gamma P_\pi v\right)
v=πmax(rπ+γPπv)
✨BOE公式右边最优问题:
v ( s ) = max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ) , ∀ s ∈ S = max π ∑ a π ( a ∣ s ) q ( s , a ) \begin{aligned} v(s) & =\max _\pi \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right)\right), \quad \forall s \in \mathcal{S} \\ & =\max _\pi \sum_a \pi(a \mid s) q(s, a) \end{aligned} v(s)=πmaxa∑π(a∣s)(r∑p(r∣s,a)r+γs′∑p(s′∣s,a)v(s′)),∀s∈S=πmaxa∑π(a∣s)q(s,a)
求解:假设已知
q
1
,
q
2
,
q
3
,
∈
R
q_1, q_2, q_3, \in \mathbb{R}
q1,q2,q3,∈R,寻找
c
1
∗
,
c
2
∗
,
c
3
∗
c_1^*, c_2^*, c_3^*
c1∗,c2∗,c3∗ 计算:
max
c
1
,
c
2
,
c
3
c
1
q
1
+
c
2
q
2
+
c
3
q
3
\max _{c_1, c_2, c_3} c_1 q_1+c_2 q_2+c_3 q_3
c1,c2,c3maxc1q1+c2q2+c3q3
- c 1 + c 2 + c 3 = 1 c_1+c_2+c_3=1 c1+c2+c3=1,且 c 1 , c 2 , c 3 ≥ 0 c_1, c_2, c_3 \geq 0 c1,c2,c3≥0(对应概率)
假设 q 3 ≥ q 1 , q 2 q_3 \geq q_1, q_2 q3≥q1,q2( q 3 q_3 q3 return 最大),则最优解为 c 3 ∗ = 1 c_3^*=1 c3∗=1, 并且 c 1 ∗ = c 2 ∗ = 0 c_1^*=c_2^*=0 c1∗=c2∗=0
- 直观解释:当 q 3 q_3 q3 最大则应该将权重都放到 q 3 q_3 q3 上,总的和最大
- 数学上解释: q 3 = ( c 1 + c 2 + c 3 ) q 3 = c 1 q 3 + c 2 q 3 + c 3 q 3 ≥ c 1 q 1 + c 2 q 2 + c 3 q 3 q_3=\left(c_1+c_2+c_3\right) q_3=c_1 q_3+c_2 q_3+c_3 q_3 \geq c_1 q_1+c_2 q_2+c_3 q_3 q3=(c1+c2+c3)q3=c1q3+c2q3+c3q3≥c1q1+c2q2+c3q3
所以由于
∑
a
π
(
a
∣
s
)
=
1
\sum_a \pi(a \mid s)=1
∑aπ(a∣s)=1,就得到如下等式,其中
a
∗
=
arg
max
a
q
(
s
,
a
)
a^*=\arg \max _a q(s, a)
a∗=argmaxaq(s,a).:
max
π
∑
a
π
(
a
∣
s
)
q
(
s
,
a
)
=
max
a
∈
A
(
s
)
q
(
s
,
a
)
\max _\pi \sum_a \pi(a \mid s) q(s, a)=\max _{a \in \mathcal{A}(s)} q(s, a)
πmaxa∑π(a∣s)q(s,a)=a∈A(s)maxq(s,a)
π ( a ∣ s ) = { 1 a = a ∗ 0 a ≠ a ∗ \pi(a \mid s)= \begin{cases}1 & a=a^* \\ 0 & a \neq a^*\end{cases} π(a∣s)={10a=a∗a=a∗
✨贝尔曼最优公式重写:
f ( v ) : = max π ( r π + γ P π v ) f(v):=\max _\pi\left(r_\pi+\gamma P_\pi v\right) f(v):=πmax(rπ+γPπv)
于是贝尔曼最优公式转变为:
v
=
f
(
v
)
v=f(v)
v=f(v)
[
f
(
v
)
]
s
=
max
π
∑
a
π
(
a
∣
s
)
q
(
s
,
a
)
,
s
∈
S
[f(v)]_s=\max _\pi \sum_a \pi(a \mid s) q(s, a), \quad s \in \mathcal{S}
[f(v)]s=πmaxa∑π(a∣s)q(s,a),s∈S
✨压缩映射定理(巴纳赫不动点定理):
【概念】
-
Fixed point(不动点): x ∈ X x \in X x∈X 是 f f f 一个不动点,有一个函数 f : X → X f: X \rightarrow X f:X→X有: f ( x ) = x f(x)=x f(x)=x
-
Contraction mapping(收缩映射): f f f 是个函数
∥ f ( x 1 ) − f ( x 2 ) ∥ ≤ γ ∥ x 1 − x 2 ∥ \left\|f\left(x_1\right)-f\left(x_2\right)\right\| \leq \gamma\left\|x_1-x_2\right\| ∥f(x1)−f(x2)∥≤γ∥x1−x2∥
- γ ∈ ( 0 , 1 ) \gamma \in(0,1) γ∈(0,1)
- ∥ ⋅ ∥ \|\cdot\| ∥⋅∥:可以为任何向量范围
【例子1】
x = f ( x ) = 0.5 x , x ∈ R . x=f(x)=0.5 x, x \in \mathbb{R} . x=f(x)=0.5x,x∈R.
- x = 0 x=0 x=0:是一个不动点
- f ( x ) f(x) f(x):也是一个收缩映射, ∥ 0.5 x 1 − 0.5 x 2 ∥ = 0.5 ∥ x 1 − x 2 ∥ ≤ γ ∥ x 1 − x 2 ∥ \left\|0.5 x_1-0.5 x_2\right\|=0.5\left\|x_1-x_2\right\| \leq \gamma\left\|x_1-x_2\right\| ∥0.5x1−0.5x2∥=0.5∥x1−x2∥≤γ∥x1−x2∥ 对于 γ ∈ [ 0.5 , 1 ) \gamma \in[0.5,1) γ∈[0.5,1)
【例子2(向量形式)】
x = f ( x ) = A x , where x ∈ R n , A ∈ R n × n and ∥ A ∥ ≤ γ < 1 . x=f(x)=A x \text {, where } x \in \mathbb{R}^n, A \in \mathbb{R}^{n \times n} \text { and }\|A\| \leq \gamma<1 \text {. } x=f(x)=Ax, where x∈Rn,A∈Rn×n and ∥A∥≤γ<1.
- x = 0 x=0 x=0:也是一个不动点 0 = A 0 0=A 0 0=A0
- f ( x ) f(x) f(x):也是一个收缩映射, ∥ A x 1 − A x 2 ∥ = ∥ A ( x 1 − x 2 ) ∥ ≤ ∥ A ∥ ∥ x 1 − x 2 ∥ ≤ γ ∥ x 1 − x 2 ∥ \left\|A x_1-A x_2\right\|=\left\|A\left(x_1-x_2\right)\right\| \leq\|A\|\left\|x_1-x_2\right\| \leq \gamma\left\|x_1-x_2\right\| ∥Ax1−Ax2∥=∥A(x1−x2)∥≤∥A∥∥x1−x2∥≤γ∥x1−x2∥
【压缩映射定理】
对于等式 x = f ( x ) x=f(x) x=f(x),如果他是一个Contraction mapping
- 存在:存在固定点 f ( x ∗ ) = x ∗ f\left(x^*\right)=x^* f(x∗)=x∗
- 唯一:这个固定的唯一存在
- 计算方式:序列 { x k } \left\{x_k\right\} {xk} 使用式子 x k + 1 = f ( x k ) x_{k+1}=f\left(x_k\right) xk+1=f(xk),当 k → ∞ k \rightarrow \infty k→∞时候 x k → x ∗ x_k \rightarrow x^* xk→x∗
✨贝尔曼最优公式解:
由于贝尔曼最优公式属于一个Contraction mapping,所以可以使用Contraction mapping theorem进行计算。
v
k
+
1
=
f
(
v
k
)
=
max
π
(
r
π
+
γ
P
π
v
k
)
v_{k+1}=f\left(v_k\right)=\max _\pi\left(r_\pi+\gamma P_\pi v_k\right)
vk+1=f(vk)=πmax(rπ+γPπvk)
v k + 1 ( s ) = max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) = max π ∑ a π ( a ∣ s ) q k ( s , a ) = max a q k ( s , a ) \begin{aligned} v_{k+1}(s) & =\max _\pi \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right) \\ & =\max _\pi \sum_a \pi(a \mid s) q_k(s, a) \\ & =\max _a q_k(s, a) \end{aligned} vk+1(s)=πmaxa∑π(a∣s)(r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vk(s′))=πmaxa∑π(a∣s)qk(s,a)=amaxqk(s,a)
【过程总结(值迭代算法)】
-
首先对某个状态s,有个估计 v k ( s ) v_k(s) vk(s)
-
对于任意的action, a ∈ A ( s ) a \in \mathcal{A}(s) a∈A(s),计算
q k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) q_k(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right) qk(s,a)=r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vk(s′) -
计算最好的策略 π k + 1 \pi_{k+1} πk+1,其中 a k ∗ ( s ) = arg max a q k ( s , a ) a_k^*(s)=\arg \max _a q_k(s, a) ak∗(s)=argmaxaqk(s,a).
π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a \mid s)=\left\{\begin{array}{cc} 1 & a=a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{array}\right. πk+1(a∣s)={10a=ak∗(s)a=ak∗(s) -
v k + 1 ( s ) = max a q k ( s , a ) v_{k+1}(s)=\max _a q_k(s, a) vk+1(s)=amaxqk(s,a)
✨贝尔曼最优公式解的最优性:
假设
v
∗
v^*
v∗是贝尔曼最优公式的解,
π
∗
\pi^*
π∗是对于
v
∗
v^*
v∗的最优策略
v
∗
=
max
π
(
r
π
+
γ
P
π
v
∗
)
π
∗
=
arg
max
π
(
r
π
+
γ
P
π
v
∗
)
v
∗
=
r
π
∗
+
γ
P
π
∗
v
∗
\begin{aligned} &v^*=\max _\pi\left(r_\pi+\gamma P_\pi v^*\right)\\ &\pi^*=\arg \max _\pi\left(r_\pi+\gamma P_\pi v^*\right)\\ &v^*=r_{\pi^*}+\gamma P_{\pi^*} v^* \end{aligned}
v∗=πmax(rπ+γPπv∗)π∗=argπmax(rπ+γPπv∗)v∗=rπ∗+γPπ∗v∗
π
∗
\pi^*
π∗:
π
∗
(
a
∣
s
)
=
{
1
a
=
a
∗
(
s
)
0
a
≠
a
∗
(
s
)
\pi^*(a \mid s)= \begin{cases}1 & a=a^*(s) \\ 0 & a \neq a^*(s)\end{cases}
π∗(a∣s)={10a=a∗(s)a=a∗(s)
【分析最优策略】
用这些红色的量将这些黑的量求出来
- 奖励设计: r r r
- 模型: p ( s ′ ∣ s , a ) , p ( r ∣ s , a ) p\left(s^{\prime} \mid s, a\right), p(r \mid s, a) p(s′∣s,a),p(r∣s,a)
- γ \gamma γ设计: γ \gamma γ
- v ( s ) , v ( s ′ ) , π ( a ∣ s ) v(s), v\left(s^{\prime}\right), \pi(a \mid s) v(s),v(s′),π(a∣s)求解的
✨ γ \gamma γ选择问题:
γ \gamma γ大远视, γ \gamma γ小近视
✨ r r r选择问题:
问题: r → a r + b ? r \rightarrow a r+b ? r→ar+b?会不会有所改变
r boundary = r forbidden = − 1 , r target = 1 , r otherstep = 0 r_{\text {boundary }}=r_{\text {forbidden }}=-1, \quad r_{\text {target }}=1, \quad r_{\text {otherstep }}=0 rboundary =rforbidden =−1,rtarget =1,rotherstep =0r boundary = r forbidden = 0 , r target = 2 , r otherstep = 1 r_{\text {boundary }}=r_{\text {forbidden }}=0, \quad r_{\text {target }}=2, \quad r_{\text {otherstep }}=1 rboundary =rforbidden =0,rtarget =2,rotherstep =1
回答:不会有改变,主要在于action value的相对值而不是绝对值
✨无意义的绕道:
问题:因为从一个到另外一个白格子不会有惩罚所以会不会有无意义的绕道问题?
回答:不会,因为到达中点慢了获得的奖励就少了
Policy ( a ) : (\mathrm{a}): (a): return = 1 + γ 1 + γ 2 1 + ⋯ = 1 / ( 1 − γ ) = 10 =1+\gamma 1+\gamma^2 1+\cdots=1 /(1-\gamma)=10 =1+γ1+γ21+⋯=1/(1−γ)=10
Policy ( b ) : (b): (b): return = 0 + γ 0 + γ 2 1 + γ 3 1 + ⋯ = γ 2 / ( 1 − γ ) = 8.1 =0+\gamma 0+\gamma^2 1+\gamma^3 1+\cdots=\gamma^2 /(1-\gamma)=8.1 =0+γ0+γ21+γ31+⋯=γ2/(1−γ)=8.1我们常常想的是每走一步无用的路就给个惩罚,但是同样如果不给惩罚它自己到达终点慢了那么得到的终点的奖励也就少了他就会自己找进路走,所里两者是等价的