Dynamic Programming
- 状态值函数
- 动作值函数
- 策略迭代过程
- 值迭代过程
1. 状态值函数
$$ \begin{aligned} V^\pi(s) &=E_{\pi}[r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+...|s_t=s]\\ &=E_{\pi}[r_{t+1}+\gamma V^{\pi}(s_{t+1})|s_t=s]\\ &=\sum_a \pi(a|s)\sum_{s'}P(s'|s,a)[R(s,a,s')+\gamma V^{\pi}(s')] \end{aligned} $$
其中,$R(s,a,s')=E[r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+...\mid s_t=s,a=a_t=\pi(s_t),s_{t+1}=s']$表示在状态$s$下选择动作$a$使状态转移到$s'$的期望回报。记$R(s,a)=\sum_{s'\in S}P(s' \mid s,a)R(s,a,s')$为状态$s$选择动作$a$的期望回报,则
$$ V^{\pi}(s)=E_{\pi}[R(s,a)+\gamma E_{P(s'|s,a)}(V^{\pi}(s'))] $$
则由Bellman方程:$$ \begin{aligned} V^*(s)&=\max_a \sum_{s'}P(s'|s,a)[R(s,a,s')+\gamma V^*(s')]\\ &=\max_a (R(s,a)+\gamma E_{P(s'|s,a)}[V^*(s')]) \end{aligned} $$
2. 动作值函数
$$ \begin{aligned} Q^{\pi}(s,a) &=E_{\pi}[r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+...\mid s_t=s,a_t=a]\\ &=R(s,a)+\gamma \sum_{s'}P(s'|s,a)V^{\pi}(s')\\ &=R(s,a)+\gamma E_{P(s'\mid s,a)}[V^{\pi}(s')] \end{aligned} $$
则最优值为:$$ \begin{aligned} Q^*(s,a) &=\max_{\pi} Q^{\pi}(s,a)\\ &=R(s,a)+\gamma E_{P(s'\mid s,a)}[\max_{a'}Q^*(s',a')] \end{aligned} $$
最优策略为:- 当已知MDP中的P和R时,$\pi^{\ast}(s)=arg\max_a V^{\ast}(s)$
- 当P和R未知时,$\pi^{\ast}(s)=arg\max_a Q^{\ast}(s,a)$
3. 策略迭代
$$ \pi_1 \rightarrow^E V^{\pi_1},Q^{\pi_1} \rightarrow^I \pi_2 \rightarrow^E ...\pi^{\ast} \rightarrow^E V^{\pi_{\ast}},Q^{\pi_{\ast}} \rightarrow^I \pi^{\ast} $$
-
策略评估,通过多次迭代估计在策略$\pi$下的值函数 $$ V_{k+1}=E_{\pi}[R(s,a)+\gamma E_{P(s'\mid s,a)}(V_k(s'))] $$ 得到$V^{\pi}$,$Q^{\pi}$。
-
策略改进 $$ \pi_{k+1}=arg\max_a Q^{\pi_k}(s,a) $$ 交替迭代上述两步,最终收敛到最优策略,迭代过程参见[1]。
4. 值迭代
直接将Bellman方程写成如下迭代形式: $$ V_{k+1}(s)=\max_a [R(s,a)+\gamma E_{P(s'\mid s,a)}(V_k(s'))] $$ 压缩映射定理保证了算法的收敛性,迭代过程参见[1]
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.