Lect3_Dynamic_Programming

Planning by Dynamic Programming

Introduction

Dynamic sequential or temporal component to the problem Programming optimising a “program”

Requirements for DP

  1. Optimal substructure: 将问题分解成多个子问题,寻找子问题的最优解而后组合起来得到大问题的最优解
  2. Overlapping subproblems: 子问题出现多次并且其解决方案可以重复利用

MDP satisfy both requirements:

  1. Bellman equation gives recursive decomposition.
  2. Value function stores and reuses solutions

DP used for planning in an MDP

Prediction:

在这里插入图片描述

Control:

在这里插入图片描述

Policy Evaluation


Iterative Policy Evaluation

Iterative application of [ Bellman  expectation  backup \text{Bellman {\color{red}expectation} backup} Bellman expectation backup](#Bellman expectation backup): v ⁡ 1 → v ⁡ 2 → … → v ⁡ π \operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_\pi v1v2vπ

  • Using Synchornous backups:
    • at each iteration k + 1 k+1 k+1
    • for all states s ∈ S s \in \mathcal{S} sS
    • update v ⁡ k + 1 ( s ) \operatorname{v}_{k+1}(s) vk+1(s) from v ⁡ k ( s ′ ) \operatorname{v}_{k}(s') vk(s)
    • where s ′ s' s​ is a successor state of s s s
  • Convergence to v ⁡ π \operatorname{v}_\pi vπ can be proved

How to update:

在这里插入图片描述


v ⁡ k + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R s a + ∑ s ′ ∈ S P s s ′ a v ⁡ k ( s ′ ) ) \operatorname{v}_{\color{red}{k+1}}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right) \\ vk+1(s)=aAπ(as)(Rsa+sSPssavk(s))
不由得想起数值分析里面迭代求解 x = f ( x ) x = f(x) x=f(x)​,即 initial  x 0 = number ,  then loop  x k + 1 = f ( x k ) \text{initial}\ x_0 = \text{number}, \ \text{then loop}\ x_{k+1} = f(x_k) initial x0=number, then loop xk+1=f(xk)

Matrix From:
v k + 1 = R π + γ P π v k \mathbf{v}^{k+1} = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi \mathbf{v}^k vk+1=Rπ+γPπvk
where
R π = ∑ a ∈ A π ( a ∣ s ) R s a P π = ∑ s ′ ∈ s P s s ′ π = ∑ s ′ ∈ S ∑ a ∈ A π ( a ∣ s ) P s s ′ a = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S P s s ′ a \begin{aligned} \mathcal{R}^\pi &= \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{R}_s^a \\ \mathcal{P}^\pi &= \sum_{s' \in \mathcal{s}} \mathcal{P}_{ss'}^\pi = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{P}_{ss'}^a = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \end{aligned} RπPπ=aAπ(as)Rsa=ssPssπ=sSaAπ(as)Pssa=aAπ(as)sSPssa


Example

在这里插入图片描述

  • γ = 1 \gamma =1 γ=1
  • Nonterminal states 1, …, 14; One terminal state (shown twice as shaded squares)
  • Actions leading out of the grid leave state unchanged, e.g. when s = 4 , a = west s = 4, a = \text{west} s=4,a=west, the next state will be s ′ = 4 s' = 4 s=4
  • Transition is deterministic given the action, e.g. P 62 north = P [ s ′ = 2 ∣ s = 6 , a = north ] = 1 \mathcal{P}_{62}^{\text{north}} = \mathcal{P} \left[s' = 2 \mid s=6, a=\text{north}\right] =1 P62north=P[s=2s=6,a=north]=1
  • Reward is -1 until the terminal state is reached
  • Uniform random policy, π ( n ∣ ⋅ ) = π ( e ∣ ⋅ ) = π ( w ∣ ⋅ ) = π ( s ∣ ⋅ ) = 0.25 \pi(n \mid ·) = \pi(e \mid ·) = \pi(w \mid ·) = \pi(s \mid ·) = 0.25 π(n)=π(e)=π(w)=π(s)=0.25

一开始给每一个状态的价值函数都初始化为0,不断地进行迭代就好了。如下:

在这里插入图片描述

计算过程如下:

For k=0:
v ⁡ 0 ( s ) = 0 ∀ s \operatorname{v}_0(s) = 0 \qquad \forall s v0(s)=0s
For k = 1: e.g.
v ⁡ 1 ( s = 4 ) = π ( a = n ∣ s = 4 ) ∗ ( R s = 4 a = n + P s = 4 , s ′ = terminal a = n v ⁡ 0 ( s ′ = terminal ) ) + = π ( a = w ∣ s = 4 ) ∗ ( R s = 4 a = w + P s = 4 , s ′ = 4 a = w v ⁡ 0 ( s ′ = 4 ) ) + = π ( a = s ∣ s = 4 ) ∗ ( R s = 4 a = s + P s = 4 , s ′ = 8 a = s v ⁡ 0 ( s ′ = 8 ) ) + = π ( a = e ∣ s = 4 ) ∗ ( R s = 4 a = e + P s = 4 , s ′ = 5 a = e v ⁡ 0 ( s ′ = 5 ) ) + = 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 + 0 ) = − 1.0 \begin{aligned} \operatorname{v}_1(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_0(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_0(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_0(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_0(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)=-1.0 \end{aligned} v1(s=4)=π(a=ns=4)(Rs=4a=n+Ps=4,s=terminala=nv0(s=terminal))+=π(a=ws=4)(Rs=4a=w+Ps=4,s=4a=wv0(s=4))+=π(a=ss=4)(Rs=4a=s+Ps=4,s=8a=sv0(s=8))+=π(a=es=4)(Rs=4a=e+Ps=4,s=5a=ev0(s=5))+=0.25(1+0)+0.25(1+0)+0.25(1+0)+0.25(1+0)=1.0
For k = 2: e.g.
v ⁡ 2 ( s = 4 ) = π ( a = n ∣ s = 4 ) ∗ ( R s = 4 a = n + P s = 4 , s ′ = terminal a = n v ⁡ 1 ( s ′ = terminal ) ) + = π ( a = w ∣ s = 4 ) ∗ ( R s = 4 a = w + P s = 4 , s ′ = 4 a = w v ⁡ 1 ( s ′ = 4 ) ) + = π ( a = s ∣ s = 4 ) ∗ ( R s = 4 a = s + P s = 4 , s ′ = 8 a = s v ⁡ 1 ( s ′ = 8 ) ) + = π ( a = e ∣ s = 4 ) ∗ ( R s = 4 a = e + P s = 4 , s ′ = 5 a = e v ⁡ 1 ( s ′ = 5 ) ) + = 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) = − 1.75 \begin{aligned} \operatorname{v}_2(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_1(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_1(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-1.75 \\ \end{aligned} v2(s=4)=π(a=ns=4)(Rs=4a=n+Ps=4,s=terminala=nv1(s=terminal))+=π(a=ws=4)(Rs=4a=w+Ps=4,s=4a=wv1(s=4))+=π(a=ss=4)(Rs=4a=s+Ps=4,s=8a=sv1(s=8))+=π(a=es=4)(Rs=4a=e+Ps=4,s=5a=ev1(s=5))+=0.25(1+0)+0.25(11)+0.25(11)+0.25(11)=1.75
v ⁡ 2 ( s = 8 ) = π ( a = n ∣ s = 8 ) ∗ ( R s = 8 a = n + P s = 8 , s ′ = 4 a = n v ⁡ 1 ( s ′ = 4 ) ) + = π ( a = w ∣ s = 8 ) ∗ ( R s = 8 a = w + P s = 8 , s ′ = 8 a = w v ⁡ 1 ( s ′ = 8 ) ) + = π ( a = s ∣ s = 8 ) ∗ ( R s = 8 a = s + P s = 8 , s ′ = 12 a = s v ⁡ 1 ( s ′ = 12 ) ) + = π ( a = e ∣ s = 8 ) ∗ ( R s = 8 a = e + P s = 8 , s ′ = 9 a = e v ⁡ 1 ( s ′ = 9 ) ) + = 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) = − 2 \begin{aligned} \operatorname{v}_2(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_1(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_1(s'=9) \right) + \\ &= 0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-2 \end{aligned} v2(s=8)=π(a=ns=8)(Rs=8a=n+Ps=8,s=4a=nv1(s=4))+=π(a=ws=8)(Rs=8a=w+Ps=8,s=8a=wv1(s=8))+=π(a=ss=8)(Rs=8a=s+Ps=8,s=12a=sv1(s=12))+=π(a=es=8)(Rs=8a=e+Ps=8,s=9a=ev1(s=9))+=0.25(11)+0.25(11)+0.25(11)+0.25(11)=2
For k=3: e.g.
v ⁡ 3 ( s = 4 ) = π ( a = n ∣ s = 4 ) ∗ ( R s = 4 a = n + P s = 4 , s ′ = terminal a = n v ⁡ 2 ( s ′ = terminal ) ) + = π ( a = w ∣ s = 4 ) ∗ ( R s = 4 a = w + P s = 4 , s ′ = 4 a = w v ⁡ 2 ( s ′ = 4 ) ) + = π ( a = s ∣ s = 4 ) ∗ ( R s = 4 a = s + P s = 4 , s ′ = 8 a = s v ⁡ 2 ( s ′ = 8 ) ) + = π ( a = e ∣ s = 4 ) ∗ ( R s = 4 a = e + P s = 4 , s ′ = 5 a = e v ⁡ 2 ( s ′ = 5 ) ) + = 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 − 1.75 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) = − 2.4375 \begin{aligned} \operatorname{v}_3(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_2(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_2(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)=-2.4375 \\ \end{aligned} v3(s=4)=π(a=ns=4)(Rs=4a=n+Ps=4,s=terminala=nv2(s=terminal))+=π(a=ws=4)(Rs=4a=w+Ps=4,s=4a=wv2(s=4))+=π(a=ss=4)(Rs=4a=s+Ps=4,s=8a=sv2(s=8))+=π(a=es=4)(Rs=4a=e+Ps=4,s=5a=ev2(s=5))+=0.25(1+0)+0.25(11.75)+0.25(12)+0.25(12)=2.4375
v ⁡ 3 ( s = 8 ) = π ( a = n ∣ s = 8 ) ∗ ( R s = 8 a = n + P s = 8 , s ′ = 4 a = n v ⁡ 2 ( s ′ = 4 ) ) + = π ( a = w ∣ s = 8 ) ∗ ( R s = 8 a = w + P s = 8 , s ′ = 8 a = w v ⁡ 2 ( s ′ = 8 ) ) + = π ( a = s ∣ s = 8 ) ∗ ( R s = 8 a = s + P s = 8 , s ′ = 12 a = s v ⁡ 2 ( s ′ = 12 ) ) + = π ( a = e ∣ s = 8 ) ∗ ( R s = 8 a = e + P s = 8 , s ′ = 9 a = e v ⁡ 2 ( s ′ = 9 ) ) + = 0.25 ∗ ( − 1 − 1.75 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) = − 2.9375 \begin{aligned} \operatorname{v}_3(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_2(s'=9) \right) + \\ &= 0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-2.9375 \\ \end{aligned} v3(s=8)=π(a=ns=8)(Rs=8a=n+Ps=8,s=4a=nv2(s=4))+=π(a=ws=8)(Rs=8a=w+Ps=8,s=8a=wv2(s=8))+=π(a=ss=8)(Rs=8a=s+Ps=8,s=12a=sv2(s=12))+=π(a=es=8)(Rs=8a=e+Ps=8,s=9a=ev2(s=9))+=0.25(11.75)+0.25(12)+0.25(12)+0.25(12)=2.9375
v ⁡ 3 ( s = 12 ) = π ( a = n ∣ s = 12 ) ∗ ( R s = 12 a = n + P s = 12 , s ′ = 8 a = n v ⁡ 2 ( s ′ = 8 ) ) + = π ( a = w ∣ s = 12 ) ∗ ( R s = 12 a = w + P s = 12 , s ′ = 12 a = w v ⁡ 2 ( s ′ = 12 ) ) + = π ( a = s ∣ s = 12 ) ∗ ( R s = 12 a = s + P s = 12 , s ′ = 12 a = s v ⁡ 2 ( s ′ = 12 ) ) + = π ( a = e ∣ s = 12 ) ∗ ( R s = 12 a = e + P s = 12 , s ′ = 13 a = e v ⁡ 2 ( s ′ = 13 ) ) + = 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) = − 3.0 \begin{aligned} \operatorname{v}_3(s=12) &= \pi(a=n \mid s=12)*\left(R_{s=12}^{a=n} + \mathcal{P}_{s=12,s'=8}^{a=n}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=w \mid s=12)*\left(R_{s=12}^{a=w} + \mathcal{P}_{s=12,s'=12}^{a=w} \operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=s \mid s=12)*\left(R_{s=12}^{a=s} + \mathcal{P}_{s=12,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=12)*\left(R_{s=12}^{a=e} + \mathcal{P}_{s=12,s'=13}^{a=e}\operatorname{v}_2(s'=13) \right) + \\ &= 0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-3.0 \end{aligned} v3(s=12)=π(a=ns=12)(Rs=12a=n+Ps=12,s=8a=nv2(s=8))+=π(a=ws=12)(Rs=12a=w+Ps=12,s=12a=wv2(s=12))+=π(a=ss=12)(Rs=12a=s+Ps=12,s=12a=sv2(s=12))+=π(a=es=12)(Rs=12a=e+Ps=12,s=13a=ev2(s=13))+=0.25(12)+0.25(12)+0.25(12)+0.25(12)=3.0

Policy Iteration

在这里插入图片描述

Algorithm:

  1. Given a policy π \pi π

  2. Loop forever until stopping condition

    1. Evaluate the policy π \pi π
      v ⁡ π ( s ) = E [ R t + 1 + γ R t + 2 + … ∣ S t = s ] \operatorname{v}_\pi(s) = \mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+ \ldots \mid S_t =s \right] vπ(s)=E[Rt+1+γRt+2+St=s]

    2. Improve the policy by acting greedily with respect to v ⁡ π \operatorname{v}_\pi vπ​​
      π ′ = greedy ⁡ ( v ⁡ π ) \pi' = \operatorname{greedy}(\operatorname{v}_\pi) π=greedy(vπ)

Policy improvement

Proof of why π ′ ≥ π \pi' \geq \pi ππ​ when use greedy:

  1. Consider a deterministic policy, a = π ( s ) a = \pi(s) a=π(s)

  2. improve the policy by acting greedily
    π ′ ( s ) = a r g   m a x a ∈ A   q π ( s , a ) \pi'(s) = \underset{a \in \mathcal{A}}{arg\,max}\ q_\pi(s,a) π(s)=aAargmax qπ(s,a)

  3. This improves the value from any state s over one step
    q π ( s , π ′ ( s ) ) = max ⁡ a ∈ A   q π ( s , a )   ≥   q π ( s , π ( s ) ) = v ⁡ π ( s ) ∀   s q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}\geq} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s qπ(s,π(s))=aAmax qπ(s,a)  qπ(s,π(s))=vπ(s) s

  4. It therefore improves the value function, v ⁡ π ′ ( s ) ≥ v ⁡ π ( s ) \operatorname{v}_{\pi'}(s) \geq \operatorname{v}_\pi(s) vπ(s)vπ(s)
    v ⁡ π ( s ) ≤ q π ( s , π ′ ( s ) ) = E π ′ [ R t + 1 + γ v ⁡ π ( S t + 1 ) ∣ S t = s ] ≤ E π ′ [ R t + 1 + γ q π ( S t + 1 , π ′ ( S t + 1 ) ) ∣ S t = s ] because step 3.  ∀   s ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , π ′ ( S t + 2 ) ) ∣ S t = s ] ≤ ⋯ ≤ E π ′ [ R t + 1 + γ R t + 2 + ⋯ ∣ S t = s ] = v ⁡ π ′ ( s ) \begin{aligned} \operatorname{v}_\pi(s) &\leq q_\pi\left(s,\pi'(s) \right) = \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma \operatorname{v}_{\color{blue}\pi} \left(S_{t+1} \right) \mid S_t = s \right] \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma q_{\color{blue}\pi} \left(S_{t+1}, \pi'(S_{t+1}) \right) \mid S_t = s \right] \qquad \text{because step 3.}\ \forall \ s \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 q_{\color{blue}\pi} \left(S_{t+2}, \pi'(S_{t+2}) \right) \mid S_t = s \right] \\ &\leq \dots \leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s \right] = \operatorname{v}_{\pi'}(s) \end{aligned} vπ(s)qπ(s,π(s))=Eπ[Rt+1+γvπ(St+1)St=s]Eπ[Rt+1+γqπ(St+1,π(St+1))St=s]because step 3.  sEπ[Rt+1+γRt+2+γ2qπ(St+2,π(St+2))St=s]Eπ[Rt+1+γRt+2+St=s]=vπ(s)
    Why π ′ {\color{red}\pi'} π and π {\color{blue}\pi} π​ ?

    得回到 action-value function 的定义来看待这个问题
    q π ( s , a ) = E π [ R t + 1 + γ v ⁡ π ( s t + 1 ) ] = ∑ a ∈ A π ( a ∣ s ) ( R t + 1 + γ v ⁡ π ( s t + 1 ) ) q_\pi(s,a) = \mathbb{E}_\pi \left[R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right] = \sum_{a \in \mathcal{A}}{\color{red}\pi(a \mid s)} \left(R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right) qπ(s,a)=Eπ[Rt+1+γvπ(st+1)]=aAπ(as)(Rt+1+γvπ(st+1))
    因此可以看出求期望时,即加权平均值的 加权为 π ( a ∣ s ) \pi(a \mid s) π(as) 和动作策略是有关的,当我们使用了 greedy 之后,每一个状态下各动作选取的概率会改变,而 state-value funtion 还是由 π \pi π 算出来的,没进行更新。所以有 π ′ {\color{red}\pi'} π and π {\color{blue}\pi} π 的区别

Proof of why converge to π ∗ \pi^* π:

  1. If improvements stop
    q π ( s , π ′ ( s ) ) = max ⁡ a ∈ A   q π ( s , a )   =   q π ( s , π ( s ) ) = v ⁡ π ( s ) ∀   s q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}=} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s qπ(s,π(s))=aAmax qπ(s,a) = qπ(s,π(s))=vπ(s) s

  2. Then the Bellman optimality equation has been satisfied
    v ⁡ π ( s ) = max ⁡ a ∈ A   q π ( s , a ) \operatorname{v}_\pi(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) vπ(s)=aAmax qπ(s,a)

  3. Therefore now v ⁡ π ( s ) = v ⁡ ∗ ( s ) ∀ s ∈ S \operatorname{v}_\pi(s) = \operatorname{v}_*(s) \quad \forall s \in \mathcal{S} vπ(s)=v(s)sS

  4. so π \pi π is an optimal policy

Value Iteration

Principle of Optimality

A policy π ( a ∣ s ) \pi(a \mid s) π(as) achieves the optimal value from state s, i.e. v ⁡ π ( s ) = v ⁡ ∗ ( s ) \operatorname{v}_\pi(s) = \operatorname{v}_*(s) vπ(s)=v(s), if and only if

  • for any state s’ reachable from s
  • π \pi π achieves the optimal value from state s’, i.e. v ⁡ π ( s ′ ) = v ⁡ ∗ ( s ′ ) \operatorname{v}_\pi(s') = \operatorname{v}_*(s') vπ(s)=v(s)

Deterministic Value Iteration

Compare to Iterative Policy Evaluation, please click here, actually the difference between them is how to update (max or expectation)

  • If we know the solution to subproblems v ⁡ ∗ ( s ′ ) \operatorname{v}_*(s') v(s)

  • Then solution v ⁡ ∗ ( s ′ ) \operatorname{v}_*(s') v(s) can be found by one-step lookhead
    v ⁡ ∗ ( s ) ← max ⁡ a ∈ A ( R S a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ ∗ ( s ′ ) ) \operatorname{v}_*(s) \leftarrow \underset{a \in \mathcal{A}}{\operatorname{max}} \left( \mathcal{R}_S^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_*(s') \right) v(s)aAmax(RSa+γsSPssav(s))

  • The idea of value iteration is to apply these updates iteratively

  • Intuition: start with final rewards and work backwards

Iterative application of [ Bellman  optimality  backup \text{Bellman {\color{red}optimality} backup} Bellman optimality backup​](#Bellman optimality backup): v ⁡ 1 → v ⁡ 2 → … → v ⁡ ∗ \operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_* v1v2v

  • Using Synchornous backups:
    • at each iteration k + 1 k+1 k+1
    • for all states s ∈ S s \in \mathcal{S} sS
    • update v ⁡ k + 1 ( s ) \operatorname{v}_{k+1}(s) vk+1(s) from v ⁡ k ( s ′ ) \operatorname{v}_{k}(s') vk(s)
  • Intermediate value function may not correspond to any policy

How to update:

在这里插入图片描述


v ⁡ k + 1 ( s ) = max ⁡ a ∈ A   ( R s a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ k ( s ′ ) ) \operatorname{v}_{\color{red}{k+1}}(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right) vk+1(s)=aAmax (Rsa+γsSPssavk(s))
Matrix From:
v k + 1 = max ⁡ a ∈ A ( R a + γ P a v ⁡ k ) \mathbf{v}_{k+1} = \underset{a \in \mathcal{A}}{\operatorname{max}}\left( \mathcal{R}^{\mathbf{a}} + \gamma \mathcal{P}^{\mathbf{a}} \operatorname{v}_k \right) vk+1=aAmax(Ra+γPavk)


A live demo

GridWorld: Dynamic Programming Demo

Summary of DP Algorithms

ProblembEllman equationAlgorithm
PredictionBellman Expectation EquationIterative Policy Evaluation
ControlBellman Expectation Equation
+ Greedy Policy Improvement
Policy Iteration
ControlBellman Optimality EquationValue Iteration
  • Algorithms are based on state-value function v ⁡ π ( s ) \operatorname{v}_\pi(s) vπ(s) or v ⁡ ∗ ( s ) \operatorname{v}_*(s) v(s)
  • m actions and n states, so complexity = O(m*n(s)*n(s’) ) = O(mn2) per iteration
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
毕业设计,基于SpringBoot+Vue+MySQL开发的公寓报修管理系统,源码+数据库+毕业论文+视频演示 现代经济快节奏发展以及不断完善升级的信息化技术,让传统数据信息的管理升级为软件存储,归纳,集中处理数据信息的管理方式。本公寓报修管理系统就是在这样的大环境下诞生,其可以帮助管理者在短时间内处理完毕庞大的数据信息,使用这种软件工具可以帮助管理人员提高事务处理效率,达到事半功倍的效果。此公寓报修管理系统利用当下成熟完善的Spring Boot框架,使用跨平台的可开发大型商业网站的Java语言,以及最受欢迎的RDBMS应用软件之一的MySQL数据库进行程序开发。公寓报修管理系统有管理员,住户,维修人员。管理员可以管理住户信息和维修人员信息,可以审核维修人员的请假信息,住户可以申请维修,可以对维修结果评价,维修人员负责住户提交的维修信息,也可以请假。公寓报修管理系统的开发根据操作人员需要设计的界面简洁美观,在功能模块布局上跟同类型网站保持一致,程序在实现基本要求功能时,也为数据信息面临的安全问题提供了一些实用的解决方案。可以说该程序在帮助管理者高效率地处理工作事务的同时,也实现了数据信息的整体化,规范化与自动化。 关键词:公寓报修管理系统;Spring Boot框架;MySQL;自动化;VUE
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值