Markov Decision processes

最新推荐文章于 2022-02-01 22:19:56 发布

@yuqing_wang

最新推荐文章于 2022-02-01 22:19:56 发布

阅读量296

点赞数

分类专栏： Introduction to AI

本文链接：https://blog.csdn.net/weixin_43199124/article/details/110872504

版权

Introduction to AI 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Non-deterministic search

the dynamics of world adds uncertainty to the outcome
导致agent 的actions nondeterministic

Markov Decision process

a model used to solve non-deterministic search
property

a set of states S
a set of actions A
start state
possibly one or more terminal states
a discount factor $\gamma$
transition function $T (s, a, s^{'})$ : a probablity function
reward function: $R (s, a, s^{'})$ small reward at each step, and large reward in the terminal
states(immediate reward/long term rewards)

goal:
make a sequence of actions,最大化累计reward
$U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+R(s^1,a^1,s^2)+\cdots$

$E(r^t|s_t,a_t)$
q state:
q(s,a):actions states use problability as edge
注：q state不会花费时间

objective:
maximum the sum of rewards

markov process: satisfies Markov property/memoriless property
$T (S, A, S^{'}) = P (S^{'} ∣ s, a)$
markov reward model: $R(s^t=s)=E(r^t |s^t)$
utility: $G_t=r_t + \gamma r_{t+1}+\cdots$
value function: $V(s)=E(G_t|s_t=s)$
horizon: number of steps in the traiectory
这个模型 no actions

finite horizon and discounting factor

为了防止每次都采取安全的一步，无限制的获取收益
finite horizon:
nonstationay policy( $\pi depends\ on\ the\ time\ left$ )

addictive utility：
$U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+R(s^1,a^1,s^2)+\cdots$

discounting utility:
$U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+\gamma R(s^1,a^1,s^2)+\cdots$
收敛 $u\leq\frac{R_{max}}{1-\gamma}$
small $\gamma\rightarrow$ small horizon

Markovianess

Markov property or memoryless property: past and future are conditionally independent given present
$P(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\cdots)=P(s_{t+1}|s_t,a_t)$

solving Markov Decision Process

solution: policy $\pi^*=\pi(s)=a$ maximize the expected utility or total reward

the bellman equation

the optimal value of a state $s$
$V^*(s)$ : expected utility starts in s and action optimally

the optimal value of a q-state:
$Q^*(s,a)$ ,the optimal value of an agent, starts in s, acting a and acting optimally
2. bellman equation:
在这里插入图片描述
a type of dynamic equation:
an equation that decomposes a problem into smaller subproblems via an inherent recursive structure
bellman equation is as a condition for optimality,如果bellman方程对于所有的 $v (s)$ 均成立，那么这些 $v (s)$ 就是 $v^*(s)$

value iteration

time-limited values: $v_k(s)$ if games end in k time steps

value iteration is a dynamic programming algorithm
在这里插入图片描述
each iteration complexity: $o(s^2a)$
一个动作可能导致所有state

convergence:

case1 if the tree has maximum depth M,the $V_M$ holds the actual untruncated values
case 2

policy extraction

$\forall s\in S,\pi^*(s)=argmax_a Q^*(s,a)=argmax_a \sum_{s'}T(s,a,s')V^*(s')$
存储q值可以减少expect的计算过程

policy iteration

define a initial policy
policy evaluation: solve matrix $o(n^3)$ or iteration o(s^2)
policy improvement: based on value evaluation to generate a new policy

dynamic based approaches
如果我们用iteration计算v，只迭代一次就更新policy，那么和value iteration相同

asynchronous DP

for each selected state, apply the appropriate back up
can significantly reduce the computation
grantee to converge if all states continue to be selected
3 simple ideas for 异步更新
in-place dynamic programming
prioritized sweeping
use magnitude of Bellman error to guide state selection

每次更新后更新bellman error
可以用优先队列来实现
real-time dynamic programming
only states that are relevant to agent
use agent’s experience to guide the selection of states
after each time-step

题目

一定要注意policy evaluation在自己做题的时候是解方程，不是迭代
policy evaluation到terminal state 的value评估方式，注意一下

@yuqing_wang

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Markov Decision processes

Non-deterministic searchMarkov Decision processpropertya set of states Sa set of actions Astart statepossibly one or more terminal statesa discount factorγ\gammaγtransition function T(s,a,s′)T(s,a,s')T(s,a,s′): a probablity functionreward functio
复制链接

扫一扫