马尔科夫决策过程
马尔科夫决策要求
- 能够检测到理想的状态:所有的状态都是可以被检测到的,如下棋的输赢状态是要被检测到的。
- 可以多次尝试:在某个状态下,我并不知道最优解是什么,所以要进行多次尝试,来找出最优解。
- 系统在下个状态只与当前状态信息有关,而与更早之前的状态无关。
在决策过程中还和当前采取的动作有关。
马尔可夫决策过程由5个元素构成
- S:表示状态集(states)
- A:表示一组动作(actions)
- P表示状态转移概率
Psa表示状态s在a作用后转移到其他状态的概率分布情况
状态s下执行动作a,转移到s’的概率可以表示为p(s’|s,a) - R:奖励函数(reward function)表示agent采取某个动作后的即时奖励
- y:折扣系数:一位置当下的reward比未来的reward更重要(或更不重要)
决策流程
1.智能体初始状态为S0
2.选择一个动作a0
3.按概率转移矩阵Psa转移到了下一个状态S1
以此类推
状态价值函数
表示t时刻状态s能获得的未来回报的期望,可以用来对某一状态或状态-动作对进行优劣评价。
最优价值函数:
Bellman方程
当前状态的价值与当前的即时奖励和下一步的价值有关
下面是一个例子
代码
首先是一个Gridworld代码
可以构造一个4*4的矩阵,左上和右下表示出口,并且把走一步的奖励值统一赋为-1(走到出口则为0)。
其中需要注意,我们得有一个gym的软件包,但是最新版本里面好像没有discrete这个模块了,因此我们需要改一下gym的版本只需要打开Anaconda Prompt输入pip install gym==0.2.0(因为我用的是这个版本)即可。
import numpy as np
import sys
from io import StringIO
from gym.envs.toy_text import discrete
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
class GridworldEnv(discrete.DiscreteEnv):
"""
Grid World environment from Sutton's Reinforcement Learning book chapter 4.
You are an agent on an MxN grid and your goal is to reach the terminal
state at the top left or the bottom right corner.
For example, a 4x4 grid looks as follows:
T o o o
o x o o
o o o o
o o o T
x is your position and T are the two terminal states.
You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
Actions going off the edge leave you in your current state.
You receive a reward of -1 at each step until you reach a terminal state.
"""
metadata = {'render.modes': ['human', 'ansi']}
def __init__(self, shape=[4,4]):
if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
raise ValueError('shape argument must be a list/tuple of length 2')
self.shape = shape
nS = np.prod(shape)
nA = 4
MAX_Y = shape[0]
MAX_X = shape[1]
P = {}
grid = np.arange(nS).reshape(shape)
it = np.nditer(grid, flags=['multi_index'])
while not it.finished:
s = it.iterindex
y, x = it.multi_index
P[s] = {a : [] for a in range(nA)}
is_done = lambda s: s == 0 or s == (nS - 1)
reward = 0.0 if is_done(s) else -1.0
# We're stuck in a terminal state
if is_done(s):
P[s][UP] = [(1.0, s, reward, True)]
P[s][RIGHT] = [(1.0, s, reward, True)]
P[s][DOWN] = [(1.0, s, reward, True)]
P[s][LEFT] = [(1.0, s, reward, True)]
# Not a terminal state
else:
ns_up = s if y == 0 else s - MAX_X
ns_right = s if x == (MAX_X - 1) else s + 1
ns_down = s if y == (MAX_Y - 1) else s + MAX_X
ns_left = s if x == 0 else s - 1
P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]
it.iternext()
# Initial state distribution is uniform
isd = np.ones(nS) / nS
# We expose the model of the environment for educational purposes
# This should not be used in any model-free learning algorithm
self.P = P
super(GridworldEnv, self).__init__(nS, nA, P, isd)
def _render(self, mode='human', close=False):
if close:
return
outfile = StringIO() if mode == 'ansi' else sys.stdout
grid = np.arange(self.nS).reshape(self.shape)
it = np.nditer(grid, flags=['multi_index'])
while not it.finished:
s = it.iterindex
y, x = it.multi_index
if self.s == s:
output = " x "
elif s == 0 or s == self.nS - 1:
output = " T "
else:
output = " o "
if x == 0:
output = output.lstrip()
if x == self.shape[1] - 1:
output = output.rstrip()
outfile.write(output)
if x == self.shape[1] - 1:
outfile.write("\n")
it.iternext()
接下来是求解过程的代码
import numpy as np
from gridworld import GridworldEnv
env = GridworldEnv()
def value_iteration(env, theta=0.0001, discount_factor=1.0):
def one_step_lookahead(state, v): # 当前状态 各状态的价值
action = np.zeros(env.nA) # nA表示有几个方向,这里是4
for a in range(env.nA):
for prob, next_state, reward, done in env.P[state][a]:
action[a] += prob * (reward + discount_factor * v[next_state]) # Bellman方程
return action
v = np.zeros(env.nS)
# 接下来进行值迭代
while True:
# 截止条件
delta = 0
# 对每个状态进行更新
for s in range(env.nS): # nS表示有多少个格子,这里应为16
# 执行函数one_step_lookahead来找到最优解
action = one_step_lookahead(s, v)
# 找到最大奖励值并更新
best_action_value = np.max(action)
# 计算所有状态delta值,找出最大的
delta = max(delta, np.abs(best_action_value - v[s]))
# 更新该状态的价值
v[s] = best_action_value
if delta < theta:
break
policy = np.zeros([env.nS, env.nA])
for s in range(env.nS):
action = one_step_lookahead(s, v)
best_action_value = np.argmax(action)
policy[s, best_action_value] = 1.0
return policy, v
policy, v = value_iteration(env)
print("策略分布")
print(policy)
print("重塑:上为0,右为1,下为2,左为3(可修改)")
print(np.reshape(np.argmax(policy, axis=1), env.shape))
运行结果
策略分布
[[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]]
重塑:上为0,右为1,下为2,左为3(可修改)
[[0 3 3 2]
[0 0 0 2]
[0 0 1 2]
[0 1 1 0]]
总结
以上是我强化学习马尔科夫决策过程和Bellman方程的学习笔记,欢迎交流学习。