所有内容来自:http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
S--->a---r--->S'--->a'---r'--->S''(reward在take action之后才有反馈,注意顺序)
另外还有两个等式:Bellman Expectation Equation,Bellman Optimality Equation 。
传统的RL的研究对象就是MDP。直接假设就是【环境是完全可观察(当前状态唯一决定了整个过程的特性)】。
Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
马尔科夫决策过程的性质:
一个状态的转移过程是MDP,当且仅当:P [St+1 | St] = P [St+1|S1, ..., St]
马尔科夫过程(马尔科夫链)A Markov Process(orMarkov Chain) is a tuple <S,P>
S is a (finite) set of states
P is a state transition probability matrix,
Pss’=P[St+1=s’ | St=s]
A Markov reward process is a Markov chain with values.
A Markov Reward Processis a tuple <S,P,R,γ>
S is a finite set of states
P is a state transition probability matrix,
Pss’ = P [S