上一篇博文介绍了MDP问题以及对应的价值迭代和策略迭代两种解法,本文我们将手把手使用python 实现在4*3格网对value iteration algorithm 进行实现。首先回顾value iteration算法,如下图所示:
其中输入中最重要的就是构造 p(s'|s, a),我们可以采用矩阵的方式,因为一共有12个格子和4种动作,所以p(s'|s,a)可以表示为一个4*12*12的矩阵。首先我们对12个格网进行编号以便方便描述,如下图所示:
接下来我们就可以构造对应的概率转移矩阵了。这里提示一点,由于Agent到达终止点后游戏停止,所以对于上图中的第8和第12个格子,其对应转移概率全部为0(这里可能会踩的坑就是认为每个格子对应于一个动作的转移概率之和必须是1,不要问我为什么知道,因为我踩过 >.<...),代码如下:
import numpy as np
#定义状态转移矩阵
upprobolity= [[0.1,0.1,0,0,0.8,0,0,0,0,0,0,0],
[0.1,0.8,0.1,0,0,0,0,0,0,0,0,0],
[0,0.1,0,0.1,0,0,0.8,0,0,0,0,0],
[0,0,0.1,0.1,0,0,0,0.8,0,0,0,0],
[0,0,0,0,0.2,0,0,0,0.8,0,0,0],
[0,0,0,0,0.1,0,0.1,0,0,0.8,0,0],
[0,0,0,0,0,0,0.1,0.1,0,0,0.8,0],
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0.9,0.1,0,0],
[0,0,0,0,0,0,0,0,0.1,0.8,0.1,0],
[0,0,0,0,0,0,0,0,0,0.1,0.8,0.1],
[0,0,0,0,0,0,0,0,0,0,0,0]]
downprobolity = [[0.9,0.1,0,0,0,0,0,0,0,0,0,0],
[0.1,0.8,0.1,0,0,0,0,0,0,0,0,0],
[0,0.1,0.8,0.1,0,0,0,0,0,0,0,0],
[0,0,0.1,0.9,0,0,0,0,0,0,0,0],
[0.8,0,0,0,0.2,0,0,0,0,0,0,0],
[0,0.8,0,0,0.1,0,0.1,0,0,0,0,0],
[0,0,0.8,0,0,0,0.1,0.1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0.8,0,0,0,0.1,0.1,0,0],
[0,0,0,0,0,0,0,0.8,0.1,0.1,0,0],
[0,0,0,0,0,0,0.8,0,0.1,0,0.1,0],
[0,0,0,0,0,0,0,0,0,0,0,0]]
leftprobolity = [[0.9,0,0,0,0.1,0,0,0,0,0,0,0],
[0.8,0.2,0,0,0,0,0,0,0,0,0,0],
[0,0.8,0.1,0,0,0,0.1,0,0,0,0,0],
[0,0,0.8,0.1,0,0,0,0.1,0,0,0,0],
[0.1,0,0,0,0.8,0,0,0,0.1,0,0,0],
[0,0.1,0,0,0.8,0,0,0,0,0.1,0,0],
[0,0,0.1,0,0,0,0.8,0,0,0,0.1,0],
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0.1,0,0,0,0.9,0,0,0],
[0,0,0,0,0,0,0,0,0.8,0.2,0,0],
[0,0,0,0,0,0,0.1,0,0,0.8,0.1,0],
[0,0,0,0,0,0,0,0,0,0,0,0]]
rightprobolity = [[0.1,0.8,0,0,0.1,0,0,0,0,0,0,0],
[0,0.2,0.8,0,0,0,0,0,0,0,0,0],
[0,0,0.1,0.8,0,0,0.1,0,0,0,0,0],
[0,0,0,0.9,0,0,0,0.1,0,0,0,0],
[0.1,0,0,0,0.8,0,0,0,0.1,0,0,0],
[0,0.1,0,0,0,0,0.8,0,0,0.1,0,0],
[0,0,0.1,0,0,0,0,0.8,0,0,0.1,0],
[0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0.1,0,0,0,0.1,0.8,0,0],
[0,0,0,0,0,0,0,0,0,0.2,0.8,0],
[0,0,0,0,0,0,0.1,0,0,0,0.1,0.8],
[0,0,0,0,0,0,0,0,0,0,0,0]]
probobility = [upprobolity, downprobolity, leftprobolity, rightprobolity]
P = np.array(probobility)
对输入进行初始化:
States = [1,2,3,4,5,6, 7, 8, 9,10,11,12]
Actions = ['up', 'down', 'left', 'right']
Rewards = [-0.04,-0.04, -0.04, -0.04,
-0.04,-10000,-0.04, -1,
-0.04,-0.04, -0.04, 1]
初始化效用值,delta
r = 0.9
U = np.zeros(12)
U_updated = np.zeros(12)
# epsilon :程序停止的阈值,当两次迭代后utility 向量差值小于epsilon 后即停止更新
epsilon = 0.01
# delta:一次迭代后新的utility向量 和 原始的utility 向量之间的差值
# state utility 向量差值的大小等于所有state中更新前后utility最大的变化量(一个标量)
# 我们这里对delta 进行初始化时将其设置为0,后面在迭代中对其进行更新
delta = 0
定义函数来更新效用值,以下代码是对单个效用值的更新
# actionid 包括[0,1,2,3]
# P[:, index, :] 为 4*12 矩阵,U为12 *1 矩阵
def maxUtility(s):
index = s-1
action_rewards = ((P[:, index, :])* U).sum(axis = 1)
next_disReward = r*(max(action_rewards))
us_updated = Rewards[index]+ next_disReward
return us_updated
根据对每个状态的最大化效用值函数,对效用值进行更新
while True:
U = U_updated.copy()
delta = 0
print(U)
for s in States:
us_updated = maxUtility(s)
U_updated[s-1] = us_updated
s_dif = np.absolute(U_updated[s-1] - U[s-1])
delta = s_dif if s_dif > delta else delta
if (delta <epsilon*(1-r)/r):
break
更新完之后,我们定义一个函数用于获取每一个state 对应的最优动作:
# 根据最大效用规则,选择下一个动作
def getNextSateAction(s):
index = s-1
action_rewards = ((P[:, index, :])* U).sum(axis = 1)
actionID = np.argmax(action_rewards)
print(Actions[actionID])
调用上面的函数获取最优动作结果(即我们的policy):
# 除障碍网格6和终止点网格(8和12)外对应的最有动作
for s in [1,2,3,4,5, 7, 9,10,11]:
getNextSateAction(s)
# output: [up,right,up,left,up,up,right,right,right]
通过可视化,我们将更新完之后的utilities 和对应最优动作展示如下:
后记:
这里我们思考一个问题,我们上一篇博客说过不论是初始状态是哪一个,最后得到最优策略是一样的。这里其实可以通过程序来印证这个问题,如果我们将States =[1,2,3,4,5,6,7,8,9,10,11,12] 中格子顺序进行随机排序后,得到最终更新的utilities 仍然将会是一样的。不信?那你试试!!
MDP 系列文章: