Markov Decision processes

Non-deterministic search

the dynamics of world adds uncertainty to the outcome
导致agent 的actions nondeterministic

Markov Decision process

a model used to solve non-deterministic search
property

  1. a set of states S
  2. a set of actions A
  3. start state
  4. possibly one or more terminal states
  5. a discount factor γ \gamma γ
  6. transition function T ( s , a , s ′ ) T(s,a,s') T(s,a,s): a probablity function
  7. reward function: R ( s , a , s ′ ) R(s,a,s') R(s,a,s) small reward at each step, and large reward in the terminal
    states(immediate reward/long term rewards)

goal:
make a sequence of actions,最大化累计reward
U ( s 0 , a 0 , s 1 , a 1 , ⋯   ) = R ( s 0 , a 0 , s 1 ) + R ( s 1 , a 1 , s 2 ) + ⋯ U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+R(s^1,a^1,s^2)+\cdots U(s0,a0,s1,a1,)=R(s0,a0,s1)+R(s1,a1,s2)+

E ( r t ∣ s t , a t ) E(r^t|s_t,a_t) E(rtst,at)
q state:
q(s,a):actions states use problability as edge
注:q state不会花费时间

objective:
maximum the sum of rewards

  1. markov process: satisfies Markov property/memoriless property
    T ( S , A , S ′ ) = P ( S ′ ∣ s , a ) T(S,A,S')=P(S'|s,a) T(S,A,S)=P(Ss,a)
    markov reward model: R ( s t = s ) = E ( r t ∣ s t ) R(s^t=s)=E(r^t |s^t) R(st=s)=E(rtst)
    utility: G t = r t + γ r t + 1 + ⋯ G_t=r_t + \gamma r_{t+1}+\cdots Gt=rt+γrt+1+
    value function: V ( s ) = E ( G t ∣ s t = s ) V(s)=E(G_t|s_t=s) V(s)=E(Gtst=s)
    horizon: number of steps in the traiectory
    这个模型 no actions
    在这里插入图片描述

finite horizon and discounting factor

为了防止每次都采取安全的一步,无限制的获取收益
finite horizon:
nonstationay policy( π d e p e n d s   o n   t h e   t i m e   l e f t \pi depends\ on\ the\ time\ left πdepends on the time left)

addictive utility:
U ( s 0 , a 0 , s 1 , a 1 , ⋯   ) = R ( s 0 , a 0 , s 1 ) + R ( s 1 , a 1 , s 2 ) + ⋯ U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+R(s^1,a^1,s^2)+\cdots U(s0,a0,s1,a1,)=R(s0,a0,s1)+R(s1,a1,s2)+

discounting utility:
U ( s 0 , a 0 , s 1 , a 1 , ⋯   ) = R ( s 0 , a 0 , s 1 ) + γ R ( s 1 , a 1 , s 2 ) + ⋯ U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+\gamma R(s^1,a^1,s^2)+\cdots U(s0,a0,s1,a1,)=R(s0,a0,s1)+γR(s1,a1,s2)+
收敛 u ≤ R m a x 1 − γ u\leq\frac{R_{max}}{1-\gamma} u1γRmax
small γ → \gamma\rightarrow γ small horizon

Markovianess

Markov property or memoryless property: past and future are conditionally independent given present
P ( s t + 1 ∣ s t , a t , s t − 1 , a t − 1 , ⋯   ) = P ( s t + 1 ∣ s t , a t ) P(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\cdots)=P(s_{t+1}|s_t,a_t) P(st+1st,at,st1,at1,)=P(st+1st,at)

solving Markov Decision Process

solution: policy π ∗ = π ( s ) = a \pi^*=\pi(s)=a π=π(s)=a maximize the expected utility or total reward

the bellman equation

the optimal value of a state s s s
V ∗ ( s ) V^*(s) V(s): expected utility starts in s and action optimally

the optimal value of a q-state:
Q ∗ ( s , a ) Q^*(s,a) Q(s,a),the optimal value of an agent, starts in s, acting a and acting optimally
2. bellman equation:
在这里插入图片描述
a type of dynamic equation:
an equation that decomposes a problem into smaller subproblems via an inherent recursive structure
bellman equation is as a condition for optimality,如果bellman方程对于所有的 v ( s ) v(s) v(s)均成立,那么这些 v ( s ) v(s) v(s)就是 v ∗ ( s ) v^*(s) v(s)

value iteration

time-limited values: v k ( s ) v_k(s) vk(s) if games end in k time steps

value iteration is a dynamic programming algorithm
在这里插入图片描述
each iteration complexity: o ( s 2 a ) o(s^2a) o(s2a)
一个动作可能导致所有state

convergence:

  • case1 if the tree has maximum depth M,the V M V_M VM holds the actual untruncated values
  • case 2
    在这里插入图片描述
    在这里插入图片描述

policy extraction

∀ s ∈ S , π ∗ ( s ) = a r g m a x a Q ∗ ( s , a ) = a r g m a x a ∑ s ′ T ( s , a , s ′ ) V ∗ ( s ′ ) \forall s\in S,\pi^*(s)=argmax_a Q^*(s,a)=argmax_a \sum_{s'}T(s,a,s')V^*(s') sS,π(s)=argmaxaQ(s,a)=argmaxasT(s,a,s)V(s)
存储q值可以减少expect的计算过程

policy iteration

  • define a initial policy
  • policy evaluation: solve matrix o ( n 3 ) o(n^3) o(n3)or iteration o(s^2)
  • policy improvement: based on value evaluation to generate a new policy
    在这里插入图片描述
    dynamic based approaches
    如果我们用iteration计算v,只迭代一次就更新policy,那么和value iteration相同

asynchronous DP

  • for each selected state, apply the appropriate back up
  • can significantly reduce the computation
  • grantee to converge if all states continue to be selected
    3 simple ideas for 异步更新
  • in-place dynamic programming
  • prioritized sweeping
    use magnitude of Bellman error to guide state selection
    在这里插入图片描述
    每次更新后更新bellman error
    可以用优先队列来实现
  • real-time dynamic programming
    only states that are relevant to agent
    use agent’s experience to guide the selection of states
    after each time-step
    在这里插入图片描述

题目

一定要注意policy evaluation在自己做题的时候是解方程,不是迭代
policy evaluation到terminal state 的value评估方式,注意一下

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值