强化学习实战(一):用值迭代和策略迭代解决Frozen Lake问题
1. Frozen Lake 问题
问题描述
Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted.
If you step into one of those holes, you'll fall into the freezing water.
At this time, there's an international frisbee shortage, so it's absolutely imperative that
you navigate across the lake and retrieve the disc.
However, the ice is slippery, so you won't always move in the direction you intend.
The surface is described using a grid like the following
SFFF
FHFH
FFFH
HFFG
S : starting point, safe
F : frozen surface, safe
H : hole, fall to your doom
G : goal, where the frisbee is located
The episode ends when you reach the goal or fall in a hole.
You receive a reward of 1 if you reach the goal, and zero otherwise.
2. 值迭代(Value Iteration)和策略迭代(Policy Iteration)
理论部分暂略
2.1 值迭代
2.1.1 伪代码
2.1.2 流程图
2.2 策略迭代
2.2.1 伪代码
2.2.2 流程图
3. 代码实现
3.1 环境代码的一些解释
-
动作(action):用数字0-3表示
LEFT = 0 DOWN = 1 RIGHT = 2 UP = 3
-
env.P[state][action]
# 动作数 4 nA = 4 # 状态数 4*4=16 nS = nrow * ncol P = {s : {a : [] for a in range(nA)} for s in range(nS)}
P[][]本质上是一个“二维数组”,状态和动作分别由数字0-15和0-3表示。P[state][action]存储的是,在状态s下采取动作a获得的一系列数据,即(转移概率,下一步状态,奖励,完成标志)这样的元组。我们通过以下的代码更好的理解P[state][action]:
P = {s : {a : [] for a in range(4)} for s in range(16)} print(P) P[1][1].append(('1','2','3','4')) P[1][1].append(('a','b','c','d')) for n in P[1][1]: a,b,c,d=n print(a,b,c,d) print(P) print(P[1][1])
输出:
{0: {0: [], 1: [], 2: [], 3: []}, 1: {0: [], 1: [], 2: [], 3: []}, 2: {0: [], 1: [], 2: [], 3: []}, 3: {0: [], 1: [], 2: [], 3: []}, 4: {0: [], 1: [], 2: [], 3: []}, 5: {0: [], 1: [], 2: [], 3: []}, 6: {0: [], 1: [], 2: [], 3: []}, 7: {0: [], 1: [], 2: [], 3: []}, 8: {0: [], 1: [], 2: [], 3: []}, 9: {0: [], 1: [], 2: [], 3: []}, 10: {0: [], 1: [], 2: [], 3: []}, 11: {0: [], 1: [], 2: [], 3: []}, 12: {0: [], 1: [], 2: [], 3: []}, 13: {0: [], 1: [], 2: [], 3: []}, 14: {0: [], 1: [], 2: [], 3: []}, 15: {0: [], 1: [], 2: [], 3: []}} a b c d {0: {0: [], 1: [], 2: [], 3: []}, 1: {0: [], 1: [('1', '2', '3', '4'), ('a', 'b', 'c', 'd')], 2: [], 3: []}, 2: {0: [], 1: [], 2: [], 3: []}, 3: {0: [], 1: [], 2: [], 3: []}, 4: {0: [], 1: [], 2: [], 3: []}, 5: {0: [], 1: [], 2: [], 3: []}, 6: {0: [], 1: [], 2: [], 3: []}, 7: {0: [], 1: [], 2: [], 3: []}, 8: {0: [], 1: [], 2: [], 3: []}, 9: {0: [], 1: [], 2: [], 3: []}, 10: {0: [], 1: [], 2: [], 3: []}, 11: {0: [], 1: [], 2: [], 3: []}, 12: {0: [], 1: [], 2: [], 3: []}, 13: {0: [], 1: [], 2: [], 3: []}, 14: {0: [], 1: [], 2: [], 3: []}, 15: {0: [], 1: [], 2: [], 3: []}} [('1', '2', '3', '4'), ('a', 'b', 'c', 'd')]
3.2 值迭代
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
# 4*4的网格,有16个格子(状态),分别用0-15表示。eon=16
eon = env.observation_space.n
# 4个动作——上下左右,分别用0-3表示。ean=4
ean = env.action_space.n
# 值迭代
def value_itration(env, gamma=1.0):
# 初始化状态值表(V表)
value_table = np.zeros(eon)
# 迭代次数
no_of_iterations = 100000
# 收敛判断阈值
threshold = 1e-20
# 开始迭代
for i in range(no_of_iterations):
# 初始化更新后的V表(旧表复制过来)
updated_value_table = np.copy(value_table)
# 计算每个状态下所有行为的next_state_rewards,并更新状态动作值表(Q表),最后取最大Q值更新V表
# 遍历每个状态
for state in range(eon):
# 初始化存储Q值的列表
Q_value = []
# 遍历每个动作
for action in range(ean):
# 初始化存储下一个状态的奖励的列表
next_states_rewards = []
# P[][]是环境定义的变量,存储状态s下采取动作a得到的元组数据(转移概率,下一步状态,奖励,完成标志)
for next_sr in env.P[state][action]:
# next_state是否是终止状态?if Yes:done=True;else:done=False
trans_prob, next_state, reward, done = next_sr
# 计算next_states_reward(公式)
next_states_rewards.append(
(trans_prob*(reward+gamma*updated_value_table[next_state])))
# 计算Q值(公式)
Q_value.append(np.sum(next_states_rewards))
# 取最大Q值更新V表,即更新当前状态的V值
value_table[state] = max(Q_value)
# 收敛判断
if(np.sum(np.fabs(updated_value_table-value_table)) <= threshold):
print("Value-itration converged at itration # %d" % (i+1))
break
# 返回V表
return value_table
# 策略选取
def extract_policy(value_table, gamma=1.0):
# 初始化存储策略的数组
policy = np.zeros(eon)
# 对每个状态构建Q表,并在该状态下对每个行为计算Q值,
for state in range(eon):
# 初始化Q表
Q_table = np.zeros(ean)
# 对每个动作计算
for action in range(ean):
# 同上
for next_sr in env.P[state][action]:
trans_prob, next_state, reward, done = next_sr
# 更新Q表,即更新动作对应的Q值(4个动作分别由0-3表示)
Q_table[action] += (trans_prob *
(reward+gamma*value_table[next_state]))
# 当前状态下,选取使Q值最大的那个动作
policy[state] = np.argmax(Q_table)
# 返回动作
return policy
# 最优值函数
optimal_value_function = value_itration(env=env, gamma=1.0)
# 最优策略
optimal_policy = extract_policy(optimal_value_function, gamma=1.0)
# 输出最优策略
print(optimal_policy)
3.3 策略迭代
# %%
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
# 4*4的网格,有16个格子(状态),分别用0-15表示。eon=16
eon = env.observation_space.n
# 4个动作——上下左右,分别用0-3表示。ean=4
ean = env.action_space.n
# 计算值函数
def compute_value_function(policy, gamma=1.0):
# 初始化V表
value_table = np.zeros(eon)
# 收敛判断阈值
threshold = 1e-10
# 循环直到收敛
while True:
# 初始化更新后的V表(旧表复制过来)
updated_value_table = np.copy(value_table)
# 计算每个状态从策略中得到的动作,然后计算值函数
# 遍历每个状态
for state in range(eon):
# 根据策略取动作
action = policy[state]
# 更新该状态的V值(公式)
value_table[state] = sum([trans_prob*(reward+gamma*updated_value_table[next_state])
for trans_prob, next_state, reward, done in env.P[state][action]])
# 收敛判断
if (np.sum((np.fabs(updated_value_table-value_table))) <= threshold):
break
# 返回V表
return value_table
# 策略选取(同上)
def extract_policy(value_table, gamma=1.0):
# 初始化存储策略的数组
policy = np.zeros(eon)
# 对每个状态构建Q表,并在该状态下对每个行为计算Q值,
for state in range(eon):
# 初始化Q表
Q_table = np.zeros(ean)
# 对每个动作计算
for action in range(ean):
# 同上
for next_sr in env.P[state][action]:
trans_prob, next_state, reward, done = next_sr
# 更新Q表,即更新动作对应的Q值(4个动作分别由0-3表示)
Q_table[action] += (trans_prob *
(reward+gamma*value_table[next_state]))
# 当前状态下,选取使Q值最大的那个策略
policy[state] = np.argmax(Q_table)
# 返回策略
return policy
# 策略迭代
def policy_iteration(env, gamma=1.0):
# 初始化随机策略,下句代码即为初始策略全为0(向左走)
random_policy = np.zeros(eon)
# 设置迭代次数
no_of_iterations = 200000
# 开始迭代
for i in range(no_of_iterations):
# 计算新的值函数
new_value_function = compute_value_function(random_policy, gamma)
# 得到新的策略
new_policy = extract_policy(new_value_function, gamma)
# 判断迭代终止条件(策略不变时)
if (np.all(random_policy == new_policy)):
print('Policy-Iteration converged as step %d.' % (i+1))
break
# 新的策略为下一次的执行策略
random_policy = new_policy
# 返回新的策略
return new_policy
# 输出最优策略
print(policy_iteration(env))
Tips:
random_policy = np.zeros(eon)
更改为
random_policy = np.ones(eon)
即初始策略全为1(向下走),收敛更快!当然,设置为0-3之间的随机数也是可以的,
random_policy = np.random.randint(0,3,(eon))+0.0
4. Reference
[1] http://gym.openai.com/
[2] Reinforcement Learning: An Introduction (2018)
[3] Hands-On Reinforcement Learning with Python: Master reinforcement and deep reinforcement learning using OpenAI Gym and TensorFlow [M]
敬请批评指正!