强化学习实战一迭代法实现4*4方格下的随机策略

最新推荐文章于 2021-11-28 21:25:52 发布

R戎

最新推荐文章于 2021-11-28 21:25:52 发布

阅读量1.6k

点赞数 1

分类专栏：强化学习

本文链接：https://blog.csdn.net/R18830287035/article/details/90210551

版权

强化学习专栏收录该内容

8 篇文章 1 订阅

订阅专栏

本篇用代码演示《强化学习》第三讲中的示例——方格世界，即用动态规划算法通过迭代计算来评估4*4方格世界中的一个随机策略。具体问题是这样：

在这里插入图片描述

已知（如上图）：

状态空间 S：S_{1} - S_{14}为非终止状态；S_{0} ，S_{15} 终止状态，图中灰色方格所示两个位置；
行为空间 A：{n, e, s, w} 对于任何非终止状态可以有向北、东、南、西移动四个行为；
转移概率 P：任何试图离开方格世界的动作其位置将不会发生改变，其余条件下将100%地转移到动作指向的位置；
即时奖励 R：任何在非终止状态间的转移得到的即时奖励均为-1，进入终止状态即时奖励为0；
衰减系数 γ：1；

当前策略π：个体采用随机行动策略，在任何一个非终止状态下有均等的几率往任意可能的方向移动，即π(n|•) = π(e|•) = π(s|•) = π(w|•) = 1/4。

问题：评估在这个方格世界里给定的策略。

该问题等同于：求解该方格世界在给定策略下的（状态）价值函数，也就是求解在给定策略下，该方格世界里每一个状态的价值。

我们使用Python编写代码解决该问题。

声明状态

states = [i for i in range(16)]
声明状态价值，并初始化各状态价值为0

values = [0 for _ in range(16)]
声明行为空间

actions = [“n”, “e”, “s”, “w”]
结合方格世界的布局特点，简易声明行为对状态的改变

ds_actions = {“n”: -4, “e”: 1, “s”: 4, “w”: -1}
声明衰减系数为1

gamma = 1.00
根据当前状态和行为确定下一状态

    def nextState(s, a):
      next_state = s
      if (s%4 == 0  and a == "w") or (s<4 and a == "n") or \
         ((s+1)%4 == 0 and a == "e") or (s > 11 and a == "s"):
        pass
      else:
        ds = ds_actions[a]
        next_state = s + ds
      return next_state

得到某一状态的即时奖励

def rewardOf(s):
  return 0 if s in [0,15] else -1

判断某一状态是否为终止状态

def isTerminateState(s):
  return s in [0,15]

获取某一状态的所有可能的后继状态

def getSuccessors(s):
  successors = []
  if isTerminateState(s):
    return successors
  for a in actions:
    next_state = nextState(s, a)
    # if s != next_state:
    successors.append(next_state)
  return successors

根据后继状态的价值更新某一状态的价值

def updateValue(s):
  sucessors = getSuccessors(s)
  newValue = 0  # values[s]
  num = 4       # len(successors)
  reward = rewardOf(s)
  for next_state in sucessors:
    newValue += 1.00/num * (reward + gamma * values[next_state])
  return newValue

进行一次迭代

def performOneIteration():
  newValues = [0 for _ in range(16)]
  for s in states:
    newValues[s] = updateValue(s)
  global values
  values = newValues
  printValue(values)

辅助函数输出状态价值

def printValue(v):

  for i in range(16):
    print('{0:>6.2f}'.format(v[i]),end = " ")
    if (i+1)%4 == 0:
      print("")
  print()

主函数

def main():
  max_iterate_times = 160
  cur_iterate_times = 0
  while cur_iterate_times <= max_iterate_times:
    print("Iterate No.{0}".format(cur_iterate_times))
    performOneIteration()
    cur_iterate_times += 1
  printValue(values)

由于事先知道该算法将在150次左右收敛，我们将最大迭代次数设为了160，最后得到的价值函数如下：

The value function converges to:

0.00 -14.00 -20.00 -22.00
-14.00 -18.00 -20.00 -20.00
-20.00 -20.00 -18.00 -14.00
-22.00 -20.00 -14.00 0.00

At Iterate No.153
从以上代码我们可以看出，我们设置了一个获取某一状态所有后续可能状态的集合这么一个方法，这就是体现动态规划算法思想的地方。如果无法获取一个状态的所有可能后续状态，那么就不能使用动态规划算法来求解。此外，我们使用的是异步更新价值的方法，即某一时刻状态的价值由前一时刻状态价值来计算。

完整的Python代码如下：

'''
Implementation of small grid world example illustrated by David Silver
in his Reinforcement Learning Lecture3 - Planning by Dynamic 
Programming. 
Author: Qiang Ye
Date: July 1, 2017

The value function converges to:
 0.00 -14.00 -20.00 -22.00 
-14.00 -18.00 -20.00 -20.00 
-20.00 -20.00 -18.00 -14.00 
-22.00 -20.00 -14.00   0.00 
At Iterate No.153
'''
# id of the states, 0 and 15 are terminal states
states = [i for i in range(16)]
#  0* 1  2   3  
#  4  5  6   7
#  8  9  10  11
#  12 13 14  15*

# initial values of states
values = [0  for _ in range(16)]

# Action
actions = ["n", "e", "s", "w"]

# 行为对应的状态改变量
# use a dictionary for convenient computation of next state id.
ds_actions = {"n": -4, "e": 1, "s": 4, "w": -1}  

# discount factor
gamma = 1.00

# 根据当前状态和采取的行为计算下一个状态id以及得到的即时奖励
def nextState(s, a):
  next_state = s
  if (s%4 == 0 and a == "w") or (s<4 and a == "n") or \
     ((s+1)%4 == 0 and a == "e") or (s > 11 and a == "s"):
    pass
  else:
    ds = ds_actions[a]
    next_state = s + ds
  return next_state

# reward of a state
def rewardOf(s):
  return 0 if s in [0,15] else -1

# check if a state is terminate state
def isTerminateState(s):
  return s in [0,15]

# get successor states of a given state s
def getSuccessors(s):
  successors = []
  if isTerminateState(s):
    return successors
  for a in actions:
    next_state = nextState(s, a)
    # if s != next_state:
    successors.append(next_state)
  return successors

# update the value of state s
def updateValue(s):
  sucessors = getSuccessors(s)
  newValue = 0  # values[s]
  num = 4       # len(successors)
  reward = rewardOf(s)
  for next_state in sucessors:
    newValue += 1.00/num * (reward + gamma * values[next_state])
  return newValue

# perform one-step iteration
def performOneIteration():
  newValues = [0 for _ in range(16)]
  for s in states:
    newValues[s] = updateValue(s)
  global values
  values = newValues
  printValue(values)

# show some array info of the small grid world
def printValue(v):
  for i in range(16):
    print('{0:>6.2f}'.format(v[i]),end = " ")
    if (i+1)%4 == 0:
      print("")
  print()

# test function
def test():
  printValue(states)
  printValue(values)
  for s in states:
    reward = rewardOf(s)
    for a in actions:
      next_state = nextState(s, a)
      print("({0}, {1}) -> {2}, with reward {3}".format(s, a,next_state, reward))

  for i in range(200):
    performOneIteration()
    printValue(values)

def main():
  max_iterate_times = 160
  cur_iterate_times = 0
  while cur_iterate_times <= max_iterate_times:
    print("Iterate No.{0}".format(cur_iterate_times))
    performOneIteration()
    cur_iterate_times += 1
  printValue(values)

if __name__ == '__main__':
  main()

结果：
在这里插入图片描述

R戎

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
1
评论
强化学习实战一迭代法实现4*4方格下的随机策略

本篇用代码演示《强化学习》第三讲中的示例——方格世界，即用动态规划算法通过迭代计算来评估4*4方格世界中的一个随机策略。具体问题是这样：已知（如上图）：状态空间 S：S_{1} - S_{14}为非终止状态；S_{0} ，S_{15} 终止状态，图中灰色方格所示两个位置；行为空间 A：{n, e, s, w} 对于任何非终止状态可以有向北、东、南、西移动四个行为；转移概率 P：任何试图离...
复制链接

扫一扫