你可以从这里 Udacity上的课程 听课,是比较简单易懂的教程,比起单纯看Sutton的书还更有意思更加无痛入门一点
(Sutton的书写的是很详细不过真的看的很累,可以结合着看吧)
Markov Decision Processes
- Markov property means only the present matters.
- The rules are stationary.
Feature | Appearance | Notion |
---|---|---|
STATE | S | a set of states |
MODEL | T(s, a, s’)~Pr(s’ | s,a) | rules, a probability from s execute a to s’ |
ACTION | A(s), A | action in the state or a set of actions |
REWARD | R(s), R(s, a), R(s, a, s’) | a scale value you get for being into a state / being into a state and taking an action / being into a state and taking an action and end up in another state |
Solution to MDP
Feature | Appearance | Notion |
---|---|---|
POLICY | π(s)~a | a function takes in a state and returns an action |
Stationary of Preference
U()stands for the utility of the sequence of the rewards receive for visiting states S0, S1, S2…
if
U(S0, S1, S2, …) > U(S0, S1’, S2’, …)
then
U(S1, S2, …) > U(S1’, S2’, …)
通过方法二 Rmax/(1-γ) 对utility定义可以使有穷数列实现无穷数列的效果(有穷时间走无穷远的距离,但永远走不到边界,有点像奇点singularity)
Optimal Policy
从某state开始的效用utility并不是指在某state下获得的reward最多(immediate),而是这个state会带来的utility最多(long term),即延时报酬delayed reward。
那么所谓最佳策略Optimal Policy,对于从某状态(一般初始状态)对于之后每一个状态返回的动作能够最大化期望效用expected utility
每次在一个state寻找下一次能带来最大化期望的action,通过这个action能进入下一个state’,那么一个state的效用就是你在该状态时获得的reward加上你后面将获得的所有奖励和的折扣值,这个也就是Bellman Equation
所以实际上每一个state的utility都来自于positive state的utility传播而来。
之前的Bellman Equation由于不是线性的,无法迭代来解,若转换成这样便可以了,那么最终就必能等收敛。