Markov Decision Process and Dynamic Programming
- Date: Match 2019
- Material from Reinforcement Learning:An Introduction,2nd,Rechard.S.Sutton;
- Code from dennyBritze, 部分做了修改;
文章目录
Abstract
MDP过程是RL环境中常见的范式,DP是解决有限MDP问题的可最优收敛办法,效率在有效平方级。DP算法基本思想是基于贝尔曼方程进行Bootstrapping,即用估计来学习估计(learn a guess from a guess)。DP需要经过反复的策略评估和策略提升过程,最终收敛到最优的策略和值函数。这一过程其实是RL很多算法的基本过程,即先进行评估策略(Prediction)再优化策略。
MDP problems set up
在RL problems set up中我们知道RL基本要素是Agent和Enviornment, 环境的种类很多,但大多都可以抽象成一个马尔科夫决策过程(MDP)或者部分马尔科夫决策过程(POMDP);
MDPs are a mathematically idealized form of the reinforcement learning problem for which precise
theoretical statements can be made.
Key elements of MDP: < S , A , P , R , γ > <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R},\gamma> <S,A,P,R,γ>
名称 | 表达式 |
---|---|
状态转移矩阵(一个Markov Matrix) | $P_{ss’}^a=P(S_{t+1}=s’ |
奖励函数 | $R_{s}^a=\mathbb{E}{\pi}[R{t+1} |
累计奖励 | G t = ∑ k = 0 ∞ γ k R t + 1 + k G_t=\sum_{k=0}^\infty\gamma^k R_{t+1+k} Gt=∑k=0∞γkRt+1+k |
值函数(Value Function) | $V_\pi(a)=\mathbb{E}[G_t |
动作值函数(Action Value Fucntion) | $Q_\pi(s,a)=\mathbb{E}[G_t |
策略(Policy) | $\pi(a |
奖励转移方程 | R t + 1 = R t + 1 ( S t , A t , S t + 1 ) R_{t+1}=R_{t+1}(S_t,A_t,S_{t+1}) Rt+1=Rt+1(St,At,St+1) |
某策略下的状态转移方程 | $P_{ss’}^\pi=\mathbb{P}(S_{t+1}=s’ |
某状态某策略下的奖励函数 | $R_{s}^\pi=\sum_{a}\pi(a |
Bellman Equation
贝尔曼方程将某时刻的值函数与其下一时刻的值函数联系起来:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... Gt=Rt+1+γRt+2+γ2Rt+3+...
G t = ∑ k = t + 1 T γ k − t − 1 R k = R t + 1 + γ G t + 1 G_t = \sum_{k=t+1}^{T}\gamma^{k-t-1}R_k = R_{t+1} + \gamma G_{t+1}