提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
前言
个人笔记,请勿转载,恳求纠错
一、epsilon-greedy
1.epsilon-greedy是什么?
看过reinforcemen learning:An introduction的第二章,greedy其实就是最大概率,也就是对于多个action,有最大概率1-epsilon选择最优action,其它action分另外的概率。
2.MC中为什么使用epsilon-greedy,由MC-es的两个缺点:(1)对于每一个episode是frist_visit,意味着我需要分别取遍所有的s,a这是非常低效的;而epsilon-greedy可以使我们只从一个s出发,只要action次数够多,就必定会走完所有s,a,提高了效率捏。
(2)上一篇笔记最后有讲(2,3)这个s如果action分布得太过随机,由于它三面碰壁,迭代次数很多时,state_value会无限接近-1。epsilon-greedy可以改变这个情况。
二、代码
1.伪代码
与MC-es只有遍历条件,和最后的policy improvement有所不同
2.代码
import numpy as np
import matplotlib.pyplot as plt
from gridWorldGame import standard_grid, negative_grid,print_values, print_policy
SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
def random_action(a, eps=0.9): # epsilon-action选择
# choose given a with probability 1 - eps + eps/4
p = np.random.random()
if p < (1 - eps):
return a
else:
return np.random.choice(ALL_POSSIBLE_ACTIONS)
def max_dict(d):
# returns the argmax (key) and max (value) from a dictionary
max_key = None
max_val = float('-inf')
for k, v in d.items():
if v > max_val:
max_val = v
max_key = k
return max_key, max_val
def play_game(grid, policy):
# returns a list of states and corresponding returns
# use an epsilon-soft policy
s = (2, 0)
grid.set_state(s)
a = random_action(policy[s]) # 就是原定的policy执行概率为0.9,但其它五个动作也有概率执行
# each triple is s(t), a(t), r(t)
# but r(t) results from taking action a(t-1) from s(t-1) and landing in s(t)
states_actions_rewards = [(s, a, 0)]
while True:
r = grid.move(a)
s = grid.current_state()
if grid.game_over(): #没有first_visit的条件了
states_actions_rewards.append((s, None, r))
break
else:
a = random_action(policy[s]) # the next state is stochastic
states_actions_rewards.append((s, a, r))
# calculate the returns by working backwards from the terminal state
G = 0
states_actions_returns = []
first = True
for s, a, r in reversed(states_actions_rewards):
# the value of the terminal state is 0 by definition
# we should ignore the first state we encounter
# and ignore the last G, which is meaningless since it doesn't correspond to any move
if first:
first = False
else:
states_actions_returns.append((s, a, G))
G = r + GAMMA*G
states_actions_returns.reverse() # we want it to be in order of state visited
return states_actions_returns
grid = negative_grid(step_cost = -0.1)
policy = {}
for s in grid.actions.keys():
policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)
# initialize Q(s,a) and returns
Q = {} # 储存平均值
returns = {} # dictionary of state -> list of returns we've received
states = grid.all_states()
for s in states:
if s in grid.actions: # not a terminal state
Q[s] = {}
for a in ALL_POSSIBLE_ACTIONS:
Q[s][a] = 0
returns[(s,a)] = []
else:
# terminal state or state we can't otherwise get to
pass
deltas = []
for t in range(5000):
# generate an episode using pi
biggest_change = 0
states_actions_returns = play_game(grid, policy)
# calculate Q(s,a)
seen_state_action_pairs = set()
for s, a, G in states_actions_returns:
# check if we have already seen s
# called "first-visit" MC policy evaluation
sa = (s, a)
if sa not in seen_state_action_pairs:
old_q = Q[s][a]
returns[sa].append(G)
Q[s][a] = np.mean(returns[sa])
biggest_change = max(biggest_change, np.abs(old_q - Q[s][a]))
seen_state_action_pairs.add(sa)
deltas.append(biggest_change)
# calculate new policy pi(s) = argmax[a]{ Q(s,a) }
for s in policy.keys():
a, _ = max_dict(Q[s])
policy[s] = a
V = {}
for s in policy.keys():
V[s] = max_dict(Q[s])[1]
print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)
不同之处(1) . s = (2, 0),只有一个初始点
(2) a = random_action(policy[s]) action非固定
3.结果:
—
即使调大epsilon,结果也依然很好包括(2,3),一个原因是该例子比较简单,其次不用从(2,3)出发,从进入(2,3)的那一刻开始,系统就告诉我们不要进去,所以(2,3)表现良好。
总结
比起前面两个MC性能更加优越,结合了epsilon-greedy算法。
摘录于:https://www.bilibili.com/video/BV1sd4y167NS?p=22&vd_source=35775f5151031f11ee2a799855c8e368
https://github.com/rookiexxj/Reinforcement_learning_tutorial_with_demo/blob/master/monte_carlo_epsilon_greedy_demo.ipynb