以下仅为个人理解,原著英文,共同探讨学习。
名词解释
policy: state 到 action的可能性的映射,记为
π
(
a
∣
s
)
\pi(a|s)
π(a∣s)
state-value:从当前状态开始到结束状态(如果有的话),期望reward。记为
v
π
(
a
)
v_{\pi}(a)
vπ(a)
action-value: 从当前状态开始,已知采取的a这个action,到结束状态(如果有的话),期望reward。记为
q
π
(
s
,
a
)
q_{\pi}(s,a)
qπ(s,a)
Markov Decision Process
只要environment的下一个状态只取决于当前状态,那么环境就满足markov的性质。
如果集合action、state、reward有限,就进一步定义为finite MDP.
如果environment满足finite MDP ,那么对于某个policy,state value function、action value function定义如下:
v
k
+
1
(
s
t
)
=
E
[
R
t
+
1
+
γ
v
k
(
s
t
+
1
)
]
=
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
(
r
+
γ
v
k
(
s
′
)
)
v_{k+1}(s_t) = E[R_{t+1}+\gamma v_{k}(s_{t+1})]\\~~~~~~~~~~~~~~~~~~~~=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)(r +\gamma v_{k}(s'))
vk+1(st)=E[Rt+1+γvk(st+1)] =a∑π(a∣s)s′,r∑p(s′,r∣s,a)(r+γvk(s′))
q
(
s
,
a
)
=
E
[
R
t
+
1
+
γ
v
k
(
s
′
)
]
=
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
(
r
+
γ
v
k
(
s
′
)
)
q(s,a)=E[R_{t+1}+\gamma v_k(s')]\\=\sum_{s',r}p(s',r|s,a)(r+\gamma v_k(s'))
q(s,a)=E[Rt+1+γvk(s′)]=s′,r∑p(s′,r∣s,a)(r+γvk(s′))
这两个的区别在于state value function是对当前状态所有可能的action所获得expected reward的估计;action value function 是当前状态采取某一个action可以获得expected reward。
两个公式中的变量是policy.
DP求解finite MDP
state value function、action value function 都为一张表。
文中in-place的意义
在更新表的过程中,更新每个状态,都需要他的后继所有状态,后继状态可能是之前更新过的状态,这个时候,如果使用更新过的值而不用之前原来的值,那么更新方法就是in-place的,通常,in-place的方法会加快收敛速度。
policy evaluation
获得policy的state value
- 初始化表
- for each state,calculate π ( a ∣ s ) \pi(a|s) π(a∣s)
- 用上述state value function的定义,更新每个state value,记录每次更新的差值,并求所有state更新过程中最大的改变量
- 该变量小于某个值,停止算法,否则转2
policy improvement
对与上述收敛的policy,如果在某一个state,采取某一个action,他的q(s,a)>=v(s),那么可以以此更新policy。当一直采用这种greedy的方式,一定会得到比上一次更好或者一样好的policy.
v π ′ ( s ) = q ( s , π ( s ) ) v_{\pi'}(s) = q(s, \pi(s)) vπ′(s)=q(s,π(s))
policy iteration
如果环境满足finite MDP,可以采用policy improvement,获得optimal policy
- init table
- do policy evaluation
- do policy improvement
sweep through state value table, if max a q ( s , a ) \max_aq(s,a) maxaq(s,a) not equals π ( s ) \pi(s) π(s),go to step 2,else done
以下为5×5的grid,(0,0)为terminal state的代码:
import numpy as np
grid_width, grid_height = 5, 6
#up down left right
action_set = np.array([[-1,0],[1,0],[0,-1],[0,1]])
policy = np.zeros((grid_height, grid_width),dtype=np.int8)
gamma = 0.9
def step(state, action):
if state[0] == 0 and state[1] == 0:
return state, 0
newstate = state + action
if newstate[0] < 0 or newstate[0] >= grid_height or newstate[1] < 0 or newstate[1] >= grid_width:
return state , -1
else:
return newstate, -1
#init state-value function
state_value = np.zeros((grid_height, grid_width))
iteration = 0
policy_evaluation_iteration = 0
while True:
#policy evaluation
while True:
delta = 0
for x in range(grid_height):
for y in range(grid_width):
next_state, reward = step(np.array([x, y]), action_set[policy[x][y]])
value = reward + gamma * state_value[next_state[0]][next_state[1]]
if delta < abs(state_value[x][y] - value):
delta = abs(state_value[x][y] - value)
state_value[x][y] = value
policy_evaluation_iteration += 1
if delta < 1e-4:
break
#policy improvement
policy_stable = True
for x in range(grid_height):
for y in range(grid_width):
next_state, reward = step(np.array([x, y]), action_set[policy[x][y]])
max_aciton_value = reward + gamma * state_value[next_state[0]][next_state[1]]
optimal_action = policy[x][y]
for actionindex in range(4):
next_state, reward = step(np.array([x, y]), action_set[actionindex])
aciton_value = reward + gamma * state_value[next_state[0]][next_state[1]]
if max_aciton_value < aciton_value:
aciton_value = max_aciton_value
optimal_action = actionindex
if optimal_action != policy[x][y]:
policy_stable = False
policy[x][y] = optimal_action
if policy_stable:
break
print(state_value)
print(policy)
print(policy_evaluation_iteration)