参考资料
[1] 强化学习入门 第二讲 基于模型的动态规划方法
强化学习的分类
分类如上图所示,需要说明的是基于模型的动态规划方法知道转移概率 P P P、立即回报函数 R R R、衰减因子 γ \gamma γ,而无模型的强化学习不知道。
动态规划的理解
动态-状态的变化;规划-优化方法。动态规划可解决的问题的两个条件:
- 可分解为多个子问题
- 子问题的解可存储并重复利用
v
∗
(
s
)
=
m
a
x
a
R
s
a
+
γ
∑
s
∈
S
P
s
s
′
a
v
∗
(
s
′
)
(1.1)
v^*(s) = max_{a}R^a_{s} + \gamma\sum_{s \in S} P^{a}_{ss'}v^*(s') \tag{1.1}
v∗(s)=maxaRsa+γs∈S∑Pss′av∗(s′)(1.1)
q
∗
(
s
,
a
)
=
R
s
a
+
γ
∑
s
∈
S
P
s
s
′
a
m
a
x
a
′
q
∗
(
s
′
,
a
′
)
(1.2)
q^*(s,a) = R^a_{s} + \gamma\sum_{s \in S} P^{a}_{ss'}max_{a'}q^*(s',a') \tag{1.2}
q∗(s,a)=Rsa+γs∈S∑Pss′amaxa′q∗(s′,a′)(1.2)
从马尔可夫决策过程的最优化贝尔曼方程满足这两个条件,因此可以用动态规划求解。其核心在于:寻找最优的策略 π \pi π以最大化值函数(1.1)
因此,需要解决两个问题:
- 如何计算值函数
- 如何选择策略
值函数的计算
通过之前对MDP的学习有如下两个公式:
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
q
π
(
s
,
a
)
(1.3)
v_\pi (s) = \sum_{a \in A} \pi(a|s)q_{\pi}(s,a) \tag{1.3}
vπ(s)=a∈A∑π(a∣s)qπ(s,a)(1.3)
q
π
(
s
,
a
)
=
R
s
a
+
γ
∑
s
′
P
s
s
′
a
v
π
(
s
′
)
(1.4)
q_\pi (s,a) = R^a_s + \gamma\sum_{s'} P^{a}_{ss'}v_{\pi}(s') \tag{1.4}
qπ(s,a)=Rsa+γs′∑Pss′avπ(s′)(1.4)
合并,得:
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
(
R
s
a
+
γ
∑
s
′
P
s
s
′
a
v
π
(
s
′
)
)
(1.5)
v_\pi (s) = \sum_{a \in A} \pi(a|s)\left( R^a_s + \gamma\sum_{s'} P^{a}_{ss'}v_{\pi}(s') \right) \tag{1.5}
vπ(s)=a∈A∑π(a∣s)(Rsa+γs′∑Pss′avπ(s′))(1.5)
该式说明
v
π
(
s
)
v_\pi (s)
vπ(s)需要由以未来状态为起始的
v
π
(
s
′
)
v_\pi (s')
vπ(s′)推出。
对于基于模型的动态规划,由于其R、P、γ均为已知,当前状态s也已知,(1.5)中的未知数仅为
v
π
(
s
′
)
v_{\pi}(s')
vπ(s′)。因此,该式可认为是一个线性方程组,采用高斯-赛德尔迭代法求解:
v k + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R s a + γ ∑ s ′ P s s ′ a v k ( s ′ ) ) (1.6) v_{k+1} (s) = \sum_{a \in A} \pi(a|s)\left( R^a_s + \gamma\sum_{s'} P^{a}_{ss'}v_{k}(s') \right) \tag{1.6} vk+1(s)=a∈A∑π(a∣s)(Rsa+γs′∑Pss′avk(s′))(1.6)
其中,k为迭代轮次。
该算法计算过程如下:
这里举[1]中网格的例子:
根据理解编写了[1]中计算网格中值函数的代码:
import numpy as np
def index2pos(vec, width):
return np.array([np.floor(vec/width), vec%width], dtype=np.int32)
def get_avil_pos(pos, shape):
ret = [
np.array([np.clip(pos[0]-1, 0, shape[0] - 1), pos[1]]),
np.array([np.clip(pos[0]+1, 0, shape[0] - 1), pos[1]]),
np.array([pos[0], np.clip(pos[1]-1, 0, shape[1] - 1)]),
np.array([pos[0], np.clip(pos[1]+1, 0, shape[1] - 1)])
]
return ret
def main():
States = np.zeros((4,4))
States_new = np.zeros((4,4))
r = -1
gamma = 1
pi = 0.25
times = 10
for k in range(1,times):
for s in range(1,14+1):
pos = index2pos(s, 4)
avil_pos = get_avil_pos(pos, np.array([4,4]))
States_new[pos[0], pos[1]] = 0
for p in avil_pos:
States_new[pos[0], pos[1]] += 1/len(avil_pos) * (
r + gamma*States[p[0], p[1]]
)
States = States_new.copy()
print("\n----------------------------------------------------")
print("\nK = {0}".format(k), end="\n\n")
print(States)
if __name__ == "__main__":
main()
运行结果如下:
----------------------------------------------------
K = 1
[[ 0. -1. -1. -1.]
[-1. -1. -1. -1.]
[-1. -1. -1. -1.]
[-1. -1. -1. 0.]]
----------------------------------------------------
K = 2
[[ 0. -1.75 -2. -2. ]
[-1.75 -2. -2. -2. ]
[-2. -2. -2. -1.75]
[-2. -2. -1.75 0. ]]
----------------------------------------------------
K = 3
[[ 0. -2.4375 -2.9375 -3. ]
[-2.4375 -2.875 -3. -2.9375]
[-2.9375 -3. -2.875 -2.4375]
[-3. -2.9375 -2.4375 0. ]]
----------------------------------------------------
K = 4
[[ 0. -3.0625 -3.84375 -3.96875]
[-3.0625 -3.71875 -3.90625 -3.84375]
[-3.84375 -3.90625 -3.71875 -3.0625 ]
[-3.96875 -3.84375 -3.0625 0. ]]
----------------------------------------------------
K = 5
[[ 0. -3.65625 -4.6953125 -4.90625 ]
[-3.65625 -4.484375 -4.78125 -4.6953125]
[-4.6953125 -4.78125 -4.484375 -3.65625 ]
[-4.90625 -4.6953125 -3.65625 0. ]]
----------------------------------------------------
K = 6
[[ 0. -4.20898438 -5.50976562 -5.80078125]
[-4.20898438 -5.21875 -5.58984375 -5.50976562]
[-5.50976562 -5.58984375 -5.21875 -4.20898438]
[-5.80078125 -5.50976562 -4.20898438 0. ]]
----------------------------------------------------
K = 7
[[ 0. -4.734375 -6.27734375 -6.65527344]
[-4.734375 -5.89941406 -6.36425781 -6.27734375]
[-6.27734375 -6.36425781 -5.89941406 -4.734375 ]
[-6.65527344 -6.27734375 -4.734375 0. ]]
----------------------------------------------------
K = 8
[[ 0. -5.2277832 -7.0078125 -7.46630859]
[-5.2277832 -6.54931641 -7.08837891 -7.0078125 ]
[-7.0078125 -7.08837891 -6.54931641 -5.2277832 ]
[-7.46630859 -7.0078125 -5.2277832 0. ]]
----------------------------------------------------
K = 9
[[ 0. -5.69622803 -7.6975708 -8.23706055]
[-5.69622803 -7.15808105 -7.77856445 -7.6975708 ]
[-7.6975708 -7.77856445 -7.15808105 -5.69622803]
[-8.23706055 -7.6975708 -5.69622803 0. ]]
策略的选择
因为当前策略的值函数时,因此在每个状态采用贪婪策略对当前策略进行改进:即在给定状态下选择可以使状态-行为值函数取值最大的策略。
即:
策略迭代算法
将上述值函数计算与策略迭代的过程结合起来,得到策略迭代算法:
根据上述算法,编写的代码如下:
import numpy as np
def index2pos(vec, width):
return np.array([np.floor(vec/width), vec%width], dtype=np.int32)
def get_avil_pos(pos, shape):
ret = {
"up": np.array([np.clip(pos[0]-1, 0, shape[0] - 1), pos[1]]), #上
"down": np.array([np.clip(pos[0]+1, 0, shape[0] - 1), pos[1]]), #下
"left": np.array([pos[0], np.clip(pos[1]-1, 0, shape[1] - 1)]), #左
"right": np.array([pos[0], np.clip(pos[1]+1, 0, shape[1] - 1)]) #右
}
return ret
def find_best_policy(policy):
best = []
for i in policy:
v, k = max(zip(i.values(), i.keys()))
best.append(k)
return best
def main():
r = -1
gamma = 1
pi = 0.25
policy = []
best_policy = []
for i in range(0,16):
policy.append({"up": 0.25, "down": 0.25, "left": 0.25, "right": 0.25})
K_times = 4
Times = 1
while True:
States = np.zeros((4,4))
States_new = np.zeros((4,4))
print("\n----------------------------------------------------")
print("{0} iteration: \n".format(Times))
print("Initial Policy:")
print(np.array(policy))
# Calculate V(s)
for k in range(1,K_times):
for s in range(1,14+1):
pos = index2pos(s, 4)
avil_pos = get_avil_pos(pos, np.array([4,4]))
States_new[pos[0], pos[1]] = 0
for p in avil_pos:
States_new[pos[0], pos[1]] += policy[s][p] * (
r + gamma*States[avil_pos[p][0], avil_pos[p][1]]
)
States = States_new.copy()
print("\n----------------------------------------------------")
print("\nAfter {0} iterations, the function of value of all states:".format(k), end="\n\n")
print(States)
# Calculate argMax
for s in range(1,14+1):
pos = index2pos(s, 4)
avil_pos = get_avil_pos(pos, np.array([4,4]))
sum = 0
minnum = 0
for p in avil_pos:
minnum = States[avil_pos[p][0], avil_pos[p][1]] if \
States[avil_pos[p][0], avil_pos[p][1]] < minnum else \
minnum
for p in avil_pos:
sum += States[avil_pos[p][0], avil_pos[p][1]] - minnum # for calculate minus value
for p in avil_pos:
policy[s][p] = (States[avil_pos[p][0], avil_pos[p][1]] - minnum) / sum
print("\n----------------------------------------------------")
print("Updated Policy:")
print(np.array(policy))
print("\nChoose:")
print(np.array(best_policy))
temp_policy = find_best_policy(policy)
if best_policy == temp_policy:
break
else:
best_policy = temp_policy
Times += 1
print("After {0} times of iteration, we find the best policy.".format(Times))
if __name__ == "__main__":
main()
运行结果(依次为状态0~15的最佳动作):
Choose:
['up' 'left' 'left' 'left' 'up' 'up' 'left' 'down' 'up' 'up' 'right'
'down' 'up' 'right' 'right' 'up']
After 2 times of iteration, we find the best policy.
需要说明的是,上述2中的值函数计算需要多步迭代,而实际中往往在迭代一定次数后的策略迭代无穷次所得到的贪婪策略是一样的,因此可以在进行一次评估之后就进行策略改善,称为值函数迭代算法。