Markov Decision Process(MDP)
Markov Property:Just depend on current status
Markov Process/Markov Chain state transition matrix P : p ( s t + 1 = s ′ ∣ s t = s ) p(s_{t+1}=s'|s_t=s) p(st+1=s′∣st=s)
从一个节点到另一个节点的概率
Markov Reward Process(MRP):add reward weights
Horizon:steps in each episode
Return:discount(avoid cyclic避免无穷奖励,在近期得到奖励)
value function
Markov Reward Process
Bellman equation: V ( s ) = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) V(s)=R(s)+\gamma\sum_{s'\in S}P(s'|s)V(s') V(s)=R(s)+γ∑s′∈SP(s′∣s)V(s′)
求解矩阵的复杂度过大,适用于小数据
一次轨迹一次采样,用于计算相应的 V t ( s ) V_t(s) Vt(s)
value matrix计算方法:
(1)Monte Carlo Algorithm
(2)动态规划Bellman equation变成 bellman update (迭代计算)
Markov Decision Process
增加决策过程
P ( s t + 1 = s ′ ∣ s t = s , a t = a ) , a t P(s_{t+1}=s'|s_t=s,a_t=a),a_t P(st+1=s′∣st=s,at=a),at表示当前采取的行为
相应的policy:
π ( a ∣ s ) = P ( a t ∣ s t ) \pi(a|s)=P(a_t|s_t) π(a∣s)=P(at∣st)
已知一个Markov奖励过程与policy π \pi π,则可以把马尔可夫决策过程转化为马尔可夫奖励过程。
Compare MP/MRP & MDP
在当前状态到下一个状态中加上了由agent控制的过程(依赖于policy的选取)
从而可以对MDP计算一个价值函数:对policy(t时刻采取各种行为对应的随机变量)求一个期望。
def:action-value function q π ( s , a ) = E π [ G t ∣ s t = s , A t = a ] q^\pi(s,a)=E_\pi[G_t|s_t=s,A_t=a] qπ(s,a)=Eπ[Gt∣st=s,At=a]
relation: v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v^{\pi}(s)=\sum_{a\in A}\pi(a|s)q^\pi(s,a) vπ(s)=∑a∈Aπ(a∣s)qπ(s,a)
Prediction & Control in MDP
Prediction:evaluate a given policy
Control:(search the optimal policy)
Dynamic Programming
Prediction:
给定policy function,简化成Markov Reward process
synchronous backup递归求 v π ( s ) v_{\pi}(s) vπ(s),此时给定policy的价值函数,递归过程是 v t ( s ) = f ( v t + 1 ( s ) ) v_t(s)=f(v_{t+1}(s)) vt(s)=f(vt+1(s))收敛到 v π ( s ) v^\pi(s) vπ(s)
v t + 1 ( s ) = R π ( s ) + γ P π ( s ′ ∣ s ) v t ( s ′ ) v_{t+1}(s)=R^\pi(s)+\gamma P^\pi(s'|s)v_t(s') vt+1(s)=Rπ(s)+γPπ(s′∣s)vt(s′)
默认:价值函数只与状态有关
Optimal Value Function:
v ∗ ( s ) = m a x π v π ( s ) v^*(s)=\underset{\pi}{max}\,v^{\pi}(s) v∗(s)=πmaxvπ(s)
π ∗ ( s ) = a r g m a x v ∗ π ( s ) \pi^*(s)=\underset{v^*}{arg\,max}\,\pi(s) π∗(s)=v∗argmaxπ(s)
Find optimal policy:
1.Policy search(穷举)
2.MDP control,在infinte horizon情况下optimal policy 是deterministic
迭代过程:计算policy π \pi π,improve policy π ′ = g r e e d y ( v π ) \pi'=greedy(v^\pi) π′=greedy(vπ)
即policy与value之间进行循环迭代
这样的操作保证效果 ↑ \uparrow ↑
Bellman optimality equation: v ∗ ( s ) = m a x a q ∗ ( s , a ) v^*(s)=max_aq^*(s,a) v∗(s)=maxaq∗(s,a)
Value Iterate 对Bellman Optimality Equation 做迭代找到最佳策略
通过每一个状态迭代
man optimality equation: v ∗ ( s ) = m a x a q ∗ ( s , a ) v^*(s)=max_aq^*(s,a) v∗(s)=maxaq∗(s,a)
Value Iterate 对Bellman Optimality Equation 做迭代找到最佳策略
通过每一个状态迭代