
  • 两个概念:最优状态值,最优策略
  • 一个工具:贝尔曼最优公式(BOE)



v π ( s 1 ) = − 1 + γ v π ( s 2 ) , v π ( s 2 ) = + 1 + γ v π ( s 4 ) , v π ( s 3 ) = + 1 + γ v π ( s 4 ) , v π ( s 4 ) = + 1 + γ v π ( s 4 ) . \begin{aligned} & v_\pi\left(s_1\right)=-1+\gamma v_\pi\left(s_2\right), \\ & v_\pi\left(s_2\right)=+1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_3\right)=+1+\gamma v_\pi\left(s_4\right), \\ & v_\pi\left(s_4\right)=+1+\gamma v_\pi\left(s_4\right) . \end{aligned} vπ(s1)=1+γvπ(s2),vπ(s2)=+1+γvπ(s4),vπ(s3)=+1+γvπ(s4),vπ(s4)=+1+γvπ(s4).
假设 γ = 0.9 \gamma=0.9 γ=0.9,我们能够计算得到 v π ( s 4 ) = v π ( s 3 ) = v π ( s 2 ) = 10 , v π ( s 1 ) = 8 v_\pi\left(s_4\right)=v_\pi\left(s_3\right)=v_\pi\left(s_2\right)=10, \quad v_\pi\left(s_1\right)=8 vπ(s4)=vπ(s3)=vπ(s2)=10,vπ(s1)=8

我们计算 s 1 s_1 s1的action value:
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) = 6.2 , q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) = 8 , q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) = 9 , q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) = 6.2 , q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) = 7.2. \begin{aligned} & q_\pi\left(s_1, a_1\right)=-1+\gamma v_\pi\left(s_1\right)=6.2, \\ & q_\pi\left(s_1, a_2\right)=-1+\gamma v_\pi\left(s_2\right)=8, \\ & q_\pi\left(s_1, a_3\right)=0+\gamma v_\pi\left(s_3\right)=9, \\ & q_\pi\left(s_1, a_4\right)=-1+\gamma v_\pi\left(s_1\right)=6.2, \\ & q_\pi\left(s_1, a_5\right)=0+\gamma v_\pi\left(s_1\right)=7.2 . \end{aligned} qπ(s1,a1)=1+γvπ(s1)=6.2,qπ(s1,a2)=1+γvπ(s2)=8,qπ(s1,a3)=0+γvπ(s3)=9,qπ(s1,a4)=1+γvπ(s1)=6.2,qπ(s1,a5)=0+γvπ(s1)=7.2.


回答:使用action value,当前策略如下:
π ( a ∣ s 1 ) = { 1 a = a 2 0 a ≠ a 2 \pi\left(a \mid s_1\right)= \begin{cases}1 & a=a_2 \\ 0 & a \neq a_2\end{cases} π(as1)={10a=a2a=a2
计算action value:
q π ( s 1 , a 1 ) = 6.2 , q π ( s 1 , a 2 ) = 8 , q π ( s 1 , a 3 ) = 9 q π ( s 1 , a 4 ) = 6.2 , q π ( s 1 , a 5 ) = 7.2. \begin{aligned} & q_\pi\left(s_1, a_1\right)=6.2, q_\pi\left(s_1, a_2\right)=8, q_\pi\left(s_1, a_3\right)=9 \\ & q_\pi\left(s_1, a_4\right)=6.2, q_\pi\left(s_1, a_5\right)=7.2 . \end{aligned} qπ(s1,a1)=6.2,qπ(s1,a2)=8,qπ(s1,a3)=9qπ(s1,a4)=6.2,qπ(s1,a5)=7.2.
如果我们选择最大的action value( a ∗ = arg ⁡ max ⁡ a q π ( s 1 , a ) = a 3 a^*=\arg \max _a q_\pi\left(s_1, a\right)=a_3 a=argmaxaqπ(s1,a)=a3),一个新的政策如下(往下走):
π new  ( a ∣ s 1 ) = { 1 a = a ∗ 0 a ≠ a ∗ \pi_{\text {new }}\left(a \mid s_1\right)= \begin{cases}1 & a=a^* \\ 0 & a \neq a^*\end{cases} πnew (as1)={10a=aa=a
发现确实使用 a 3 a_3 a3策略的时候效果更好


state value能够用来衡量一个策略是好还是不好,如果满足下面式子,则表明 π 1 \pi_1 π1 π 2 \pi_2 π2
v π 1 ( s ) ≥ v π 2 ( s )  for all  s ∈ S v_{\pi_1}(s) \geq v_{\pi_2}(s) \quad \text { for all } s \in \mathcal{S} vπ1(s)vπ2(s) for all sS


一个策略 π ∗ \pi^* π是最优的:对于所有 s s s 和所有其他策略 π \pi π 的情况下 v π ∗ ( s ) ≥ v π ( s ) v_{\pi^*}(s) \geq v_\pi(s) vπ(s)vπ(s)


v ( s ) = ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ) , ∀ s ∈ S v(s)=\quad \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right)\right), \quad \forall s \in \mathcal{S} v(s)=aπ(as)(rp(rs,a)r+γsp(ss,a)v(s)),sS
贝尔曼最优公式:在 π \pi π 前面加上了 max ⁡ π \max _\pi maxπ,嵌套了一个优化问题
v ( s ) = max ⁡ π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ) , ∀ s ∈ S = max ⁡ π ∑ a π ( a ∣ s ) q ( s , a ) s ∈ S \begin{aligned} v(s) & =\max _\pi \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right)\right), \quad \forall s \in \mathcal{S} \\ & =\max _\pi \sum_a \pi(a \mid s) q(s, a) \quad s \in \mathcal{S} \end{aligned} v(s)=πmaxaπ(as)(rp(rs,a)r+γsp(ss,a)v(s)),sS=πmaxaπ(as)q(s,a)sS

  • p ( r ∣ s , a ) , p ( s ′ ∣ s , a ) p(r \mid s, a), p\left(s^{\prime} \mid s, a\right) p(rs,a),p(ss,a):知道
  • v ( s ) , v ( s ′ ) v(s), v\left(s^{\prime}\right) v(s),v(s):不知道需要计算的

v = max ⁡ π ( r π + γ P π v ) v=\max _\pi\left(r_\pi+\gamma P_\pi v\right) v=πmax(rπ+γPπv)


v ( s ) = max ⁡ π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v ( s ′ ) ) , ∀ s ∈ S = max ⁡ π ∑ a π ( a ∣ s ) q ( s , a ) \begin{aligned} v(s) & =\max _\pi \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right)\right), \quad \forall s \in \mathcal{S} \\ & =\max _\pi \sum_a \pi(a \mid s) q(s, a) \end{aligned} v(s)=πmaxaπ(as)(rp(rs,a)r+γsp(ss,a)v(s)),sS=πmaxaπ(as)q(s,a)

求解:假设已知 q 1 , q 2 , q 3 , ∈ R q_1, q_2, q_3, \in \mathbb{R} q1,q2,q3,R,寻找 c 1 ∗ , c 2 ∗ , c 3 ∗ c_1^*, c_2^*, c_3^* c1,c2,c3 计算:
max ⁡ c 1 , c 2 , c 3 c 1 q 1 + c 2 q 2 + c 3 q 3 \max _{c_1, c_2, c_3} c_1 q_1+c_2 q_2+c_3 q_3 c1,c2,c3maxc1q1+c2q2+c3q3

  • c 1 + c 2 + c 3 = 1 c_1+c_2+c_3=1 c1+c2+c3=1,且 c 1 , c 2 , c 3 ≥ 0 c_1, c_2, c_3 \geq 0 c1,c2,c30(对应概率)

假设 q 3 ≥ q 1 , q 2 q_3 \geq q_1, q_2 q3q1,q2 q 3 q_3 q3 return 最大),则最优解为 c 3 ∗ = 1 c_3^*=1 c3=1, 并且 c 1 ∗ = c 2 ∗ = 0 c_1^*=c_2^*=0 c1=c2=0

  • 直观解释:当 q 3 q_3 q3 最大则应该将权重都放到 q 3 q_3 q3 上,总的和最大
  • 数学上解释: q 3 = ( c 1 + c 2 + c 3 ) q 3 = c 1 q 3 + c 2 q 3 + c 3 q 3 ≥ c 1 q 1 + c 2 q 2 + c 3 q 3 q_3=\left(c_1+c_2+c_3\right) q_3=c_1 q_3+c_2 q_3+c_3 q_3 \geq c_1 q_1+c_2 q_2+c_3 q_3 q3=(c1+c2+c3)q3=c1q3+c2q3+c3q3c1q1+c2q2+c3q3

所以由于 ∑ a π ( a ∣ s ) = 1 \sum_a \pi(a \mid s)=1 aπ(as)=1,就得到如下等式,其中 a ∗ = arg ⁡ max ⁡ a q ( s , a ) a^*=\arg \max _a q(s, a) a=argmaxaq(s,a).:
max ⁡ π ∑ a π ( a ∣ s ) q ( s , a ) = max ⁡ a ∈ A ( s ) q ( s , a ) \max _\pi \sum_a \pi(a \mid s) q(s, a)=\max _{a \in \mathcal{A}(s)} q(s, a) πmaxaπ(as)q(s,a)=aA(s)maxq(s,a)

π ( a ∣ s ) = { 1 a = a ∗ 0 a ≠ a ∗ \pi(a \mid s)= \begin{cases}1 & a=a^* \\ 0 & a \neq a^*\end{cases} π(as)={10a=aa=a


f ( v ) : = max ⁡ π ( r π + γ P π v ) f(v):=\max _\pi\left(r_\pi+\gamma P_\pi v\right) f(v):=πmax(rπ+γPπv)

于是贝尔曼最优公式转变为: v = f ( v ) v=f(v) v=f(v)
[ f ( v ) ] s = max ⁡ π ∑ a π ( a ∣ s ) q ( s , a ) , s ∈ S [f(v)]_s=\max _\pi \sum_a \pi(a \mid s) q(s, a), \quad s \in \mathcal{S} [f(v)]s=πmaxaπ(as)q(s,a),sS


  • Fixed point(不动点): x ∈ X x \in X xX f f f 一个不动点,有一个函数 f : X → X f: X \rightarrow X f:XX有: f ( x ) = x f(x)=x f(x)=x

  • Contraction mapping(收缩映射): f f f 是个函数


    ∥ f ( x 1 ) − f ( x 2 ) ∥ ≤ γ ∥ x 1 − x 2 ∥ \left\|f\left(x_1\right)-f\left(x_2\right)\right\| \leq \gamma\left\|x_1-x_2\right\| f(x1)f(x2)γx1x2

    • γ ∈ ( 0 , 1 ) \gamma \in(0,1) γ(0,1)
    • ∥ ⋅ ∥ \|\cdot\| :可以为任何向量范围

x = f ( x ) = 0.5 x , x ∈ R . x=f(x)=0.5 x, x \in \mathbb{R} . x=f(x)=0.5x,xR.

  • x = 0 x=0 x=0:是一个不动点
  • f ( x ) f(x) f(x):也是一个收缩映射, ∥ 0.5 x 1 − 0.5 x 2 ∥ = 0.5 ∥ x 1 − x 2 ∥ ≤ γ ∥ x 1 − x 2 ∥ \left\|0.5 x_1-0.5 x_2\right\|=0.5\left\|x_1-x_2\right\| \leq \gamma\left\|x_1-x_2\right\| 0.5x10.5x2=0.5x1x2γx1x2 对于 γ ∈ [ 0.5 , 1 ) \gamma \in[0.5,1) γ[0.5,1)

x = f ( x ) = A x , where  x ∈ R n , A ∈ R n × n  and  ∥ A ∥ ≤ γ < 1 .  x=f(x)=A x \text {, where } x \in \mathbb{R}^n, A \in \mathbb{R}^{n \times n} \text { and }\|A\| \leq \gamma<1 \text {. } x=f(x)=Ax, where xRn,ARn×n and Aγ<1

  • x = 0 x=0 x=0:也是一个不动点 0 = A 0 0=A 0 0=A0
  • f ( x ) f(x) f(x):也是一个收缩映射, ∥ A x 1 − A x 2 ∥ = ∥ A ( x 1 − x 2 ) ∥ ≤ ∥ A ∥ ∥ x 1 − x 2 ∥ ≤ γ ∥ x 1 − x 2 ∥ \left\|A x_1-A x_2\right\|=\left\|A\left(x_1-x_2\right)\right\| \leq\|A\|\left\|x_1-x_2\right\| \leq \gamma\left\|x_1-x_2\right\| Ax1Ax2=A(x1x2)Ax1x2γx1x2

对于等式 x = f ( x ) x=f(x) x=f(x),如果他是一个Contraction mapping

  • 存在:存在固定点 f ( x ∗ ) = x ∗ f\left(x^*\right)=x^* f(x)=x
  • 唯一:这个固定的唯一存在
  • 计算方式:序列 { x k } \left\{x_k\right\} {xk} 使用式子 x k + 1 = f ( x k ) x_{k+1}=f\left(x_k\right) xk+1=f(xk),当 k → ∞ k \rightarrow \infty k时候 x k → x ∗ x_k \rightarrow x^* xkx


由于贝尔曼最优公式属于一个Contraction mapping,所以可以使用Contraction mapping theorem进行计算。
v k + 1 = f ( v k ) = max ⁡ π ( r π + γ P π v k ) v_{k+1}=f\left(v_k\right)=\max _\pi\left(r_\pi+\gamma P_\pi v_k\right) vk+1=f(vk)=πmax(rπ+γPπvk)

v k + 1 ( s ) = max ⁡ π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) = max ⁡ π ∑ a π ( a ∣ s ) q k ( s , a ) = max ⁡ a q k ( s , a ) \begin{aligned} v_{k+1}(s) & =\max _\pi \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right) \\ & =\max _\pi \sum_a \pi(a \mid s) q_k(s, a) \\ & =\max _a q_k(s, a) \end{aligned} vk+1(s)=πmaxaπ(as)(rp(rs,a)r+γsp(ss,a)vk(s))=πmaxaπ(as)qk(s,a)=amaxqk(s,a)

  1. 首先对某个状态s,有个估计 v k ( s ) v_k(s) vk(s)

  2. 对于任意的action, a ∈ A ( s ) a \in \mathcal{A}(s) aA(s),计算
    q k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) q_k(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right) qk(s,a)=rp(rs,a)r+γsp(ss,a)vk(s)

  3. 计算最好的策略 π k + 1 \pi_{k+1} πk+1,其中 a k ∗ ( s ) = arg ⁡ max ⁡ a q k ( s , a ) a_k^*(s)=\arg \max _a q_k(s, a) ak(s)=argmaxaqk(s,a).
    π k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a \mid s)=\left\{\begin{array}{cc} 1 & a=a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{array}\right. πk+1(as)={10a=ak(s)a=ak(s)

  4. v k + 1 ( s ) = max ⁡ a q k ( s , a ) v_{k+1}(s)=\max _a q_k(s, a) vk+1(s)=amaxqk(s,a)


假设 v ∗ v^* v是贝尔曼最优公式的解, π ∗ \pi^* π是对于 v ∗ v^* v的最优策略
v ∗ = max ⁡ π ( r π + γ P π v ∗ ) π ∗ = arg ⁡ max ⁡ π ( r π + γ P π v ∗ ) v ∗ = r π ∗ + γ P π ∗ v ∗ \begin{aligned} &v^*=\max _\pi\left(r_\pi+\gamma P_\pi v^*\right)\\ &\pi^*=\arg \max _\pi\left(r_\pi+\gamma P_\pi v^*\right)\\ &v^*=r_{\pi^*}+\gamma P_{\pi^*} v^* \end{aligned} v=πmax(rπ+γPπv)π=argπmax(rπ+γPπv)v=rπ+γPπv
π ∗ \pi^* π
π ∗ ( a ∣ s ) = { 1 a = a ∗ ( s ) 0 a ≠ a ∗ ( s ) \pi^*(a \mid s)= \begin{cases}1 & a=a^*(s) \\ 0 & a \neq a^*(s)\end{cases} π(as)={10a=a(s)a=a(s)




  • 奖励设计: r r r
  • 模型: p ( s ′ ∣ s , a ) , p ( r ∣ s , a ) p\left(s^{\prime} \mid s, a\right), p(r \mid s, a) p(ss,a),p(rs,a)
  • γ \gamma γ设计: γ \gamma γ
  • v ( s ) , v ( s ′ ) , π ( a ∣ s ) v(s), v\left(s^{\prime}\right), \pi(a \mid s) v(s),v(s),π(as)求解的

γ \gamma γ选择问题:

γ \gamma γ大远视, γ \gamma γ小近视




r r r选择问题:



问题: r → a r + b ? r \rightarrow a r+b ? rar+b?会不会有所改变
r boundary  = r forbidden  = − 1 , r target  = 1 , r otherstep  = 0 r_{\text {boundary }}=r_{\text {forbidden }}=-1, \quad r_{\text {target }}=1, \quad r_{\text {otherstep }}=0 rboundary =rforbidden =1,rtarget =1,rotherstep =0

r boundary  = r forbidden  = 0 , r target  = 2 , r otherstep  = 1 r_{\text {boundary }}=r_{\text {forbidden }}=0, \quad r_{\text {target }}=2, \quad r_{\text {otherstep }}=1 rboundary =rforbidden =0,rtarget =2,rotherstep =1

回答:不会有改变,主要在于action value的相对值而不是绝对值





Policy ( a ) : (\mathrm{a}): (a): return = 1 + γ 1 + γ 2 1 + ⋯ = 1 / ( 1 − γ ) = 10 =1+\gamma 1+\gamma^2 1+\cdots=1 /(1-\gamma)=10 =1+γ1+γ21+=1/(1γ)=10
Policy ( b ) : (b): (b): return = 0 + γ 0 + γ 2 1 + γ 3 1 + ⋯ = γ 2 / ( 1 − γ ) = 8.1 =0+\gamma 0+\gamma^2 1+\gamma^3 1+\cdots=\gamma^2 /(1-\gamma)=8.1 =0+γ0+γ21+γ31+=γ2/(1γ)=8.1


  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


