Non-deterministic search
the dynamics of world adds uncertainty to the outcome
导致agent 的actions nondeterministic
Markov Decision process
a model used to solve non-deterministic search
property
- a set of states S
- a set of actions A
- start state
- possibly one or more terminal states
- a discount factor γ \gamma γ
- transition function T ( s , a , s ′ ) T(s,a,s') T(s,a,s′): a probablity function
- reward function:
R
(
s
,
a
,
s
′
)
R(s,a,s')
R(s,a,s′) small reward at each step, and large reward in the terminal
states(immediate reward/long term rewards)
goal:
make a sequence of actions,最大化累计reward
U
(
s
0
,
a
0
,
s
1
,
a
1
,
⋯
)
=
R
(
s
0
,
a
0
,
s
1
)
+
R
(
s
1
,
a
1
,
s
2
)
+
⋯
U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+R(s^1,a^1,s^2)+\cdots
U(s0,a0,s1,a1,⋯)=R(s0,a0,s1)+R(s1,a1,s2)+⋯
E
(
r
t
∣
s
t
,
a
t
)
E(r^t|s_t,a_t)
E(rt∣st,at)
q state:
q(s,a):actions states use problability as edge
注:q state不会花费时间
objective:
maximum the sum of rewards
- markov process: satisfies Markov property/memoriless property
T ( S , A , S ′ ) = P ( S ′ ∣ s , a ) T(S,A,S')=P(S'|s,a) T(S,A,S′)=P(S′∣s,a)
markov reward model: R ( s t = s ) = E ( r t ∣ s t ) R(s^t=s)=E(r^t |s^t) R(st=s)=E(rt∣st)
utility: G t = r t + γ r t + 1 + ⋯ G_t=r_t + \gamma r_{t+1}+\cdots Gt=rt+γrt+1+⋯
value function: V ( s ) = E ( G t ∣ s t = s ) V(s)=E(G_t|s_t=s) V(s)=E(Gt∣st=s)
horizon: number of steps in the traiectory
这个模型 no actions
finite horizon and discounting factor
为了防止每次都采取安全的一步,无限制的获取收益
finite horizon:
nonstationay policy(
π
d
e
p
e
n
d
s
o
n
t
h
e
t
i
m
e
l
e
f
t
\pi depends\ on\ the\ time\ left
πdepends on the time left)
addictive utility:
U
(
s
0
,
a
0
,
s
1
,
a
1
,
⋯
)
=
R
(
s
0
,
a
0
,
s
1
)
+
R
(
s
1
,
a
1
,
s
2
)
+
⋯
U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+R(s^1,a^1,s^2)+\cdots
U(s0,a0,s1,a1,⋯)=R(s0,a0,s1)+R(s1,a1,s2)+⋯
discounting utility:
U
(
s
0
,
a
0
,
s
1
,
a
1
,
⋯
)
=
R
(
s
0
,
a
0
,
s
1
)
+
γ
R
(
s
1
,
a
1
,
s
2
)
+
⋯
U(s^0,a^0,s^1,a^1,\cdots)=R(s^0,a^0,s^1)+\gamma R(s^1,a^1,s^2)+\cdots
U(s0,a0,s1,a1,⋯)=R(s0,a0,s1)+γR(s1,a1,s2)+⋯
收敛
u
≤
R
m
a
x
1
−
γ
u\leq\frac{R_{max}}{1-\gamma}
u≤1−γRmax
small
γ
→
\gamma\rightarrow
γ→ small horizon
Markovianess
Markov property or memoryless property: past and future are conditionally independent given present
P
(
s
t
+
1
∣
s
t
,
a
t
,
s
t
−
1
,
a
t
−
1
,
⋯
)
=
P
(
s
t
+
1
∣
s
t
,
a
t
)
P(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\cdots)=P(s_{t+1}|s_t,a_t)
P(st+1∣st,at,st−1,at−1,⋯)=P(st+1∣st,at)
solving Markov Decision Process
solution: policy π ∗ = π ( s ) = a \pi^*=\pi(s)=a π∗=π(s)=a maximize the expected utility or total reward
the bellman equation
the optimal value of a state
s
s
s
V
∗
(
s
)
V^*(s)
V∗(s): expected utility starts in s and action optimally
the optimal value of a q-state:
Q
∗
(
s
,
a
)
Q^*(s,a)
Q∗(s,a),the optimal value of an agent, starts in s, acting a and acting optimally
2. bellman equation:
a type of dynamic equation:
an equation that decomposes a problem into smaller subproblems via an inherent recursive structure
bellman equation is as a condition for optimality,如果bellman方程对于所有的
v
(
s
)
v(s)
v(s)均成立,那么这些
v
(
s
)
v(s)
v(s)就是
v
∗
(
s
)
v^*(s)
v∗(s)
value iteration
time-limited values: v k ( s ) v_k(s) vk(s) if games end in k time steps
value iteration is a dynamic programming algorithm
each iteration complexity:
o
(
s
2
a
)
o(s^2a)
o(s2a)
一个动作可能导致所有state
convergence:
- case1 if the tree has maximum depth M,the V M V_M VM holds the actual untruncated values
- case 2
policy extraction
∀
s
∈
S
,
π
∗
(
s
)
=
a
r
g
m
a
x
a
Q
∗
(
s
,
a
)
=
a
r
g
m
a
x
a
∑
s
′
T
(
s
,
a
,
s
′
)
V
∗
(
s
′
)
\forall s\in S,\pi^*(s)=argmax_a Q^*(s,a)=argmax_a \sum_{s'}T(s,a,s')V^*(s')
∀s∈S,π∗(s)=argmaxaQ∗(s,a)=argmaxa∑s′T(s,a,s′)V∗(s′)
存储q值可以减少expect的计算过程
policy iteration
- define a initial policy
- policy evaluation: solve matrix o ( n 3 ) o(n^3) o(n3)or iteration o(s^2)
- policy improvement: based on value evaluation to generate a new policy
dynamic based approaches
如果我们用iteration计算v,只迭代一次就更新policy,那么和value iteration相同
asynchronous DP
- for each selected state, apply the appropriate back up
- can significantly reduce the computation
- grantee to converge if all states continue to be selected
3 simple ideas for 异步更新 - in-place dynamic programming
- prioritized sweeping
use magnitude of Bellman error to guide state selection
每次更新后更新bellman error
可以用优先队列来实现 - real-time dynamic programming
only states that are relevant to agent
use agent’s experience to guide the selection of states
after each time-step
题目
一定要注意policy evaluation在自己做题的时候是解方程,不是迭代
policy evaluation到terminal state 的value评估方式,注意一下