强化学习5--MC epsilon-greedy

提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档


前言


个人笔记,请勿转载,恳求纠错
在这里插入图片描述

一、epsilon-greedy

1.epsilon-greedy是什么?
在这里插入图片描述
看过reinforcemen learning:An introduction的第二章,greedy其实就是最大概率,也就是对于多个action,有最大概率1-epsilon选择最优action,其它action分另外的概率。

2.MC中为什么使用epsilon-greedy,由MC-es的两个缺点:(1)对于每一个episode是frist_visit,意味着我需要分别取遍所有的s,a这是非常低效的;而epsilon-greedy可以使我们只从一个s出发,只要action次数够多,就必定会走完所有s,a,提高了效率捏。
(2)上一篇笔记最后有讲(2,3)这个s如果action分布得太过随机,由于它三面碰壁,迭代次数很多时,state_value会无限接近-1。epsilon-greedy可以改变这个情况。

二、代码

1.伪代码

代码如下(示例):
与MC-es只有遍历条件,和最后的policy improvement有所不同

2.代码

import numpy as np
import matplotlib.pyplot as plt
from gridWorldGame import standard_grid, negative_grid,print_values, print_policy

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

def random_action(a, eps=0.9): # epsilon-action选择
  # choose given a with probability 1 - eps + eps/4
  p = np.random.random()
  if p < (1 - eps):
    return a
  else:
    return np.random.choice(ALL_POSSIBLE_ACTIONS)

def max_dict(d):
  # returns the argmax (key) and max (value) from a dictionary
  max_key = None
  max_val = float('-inf')
  for k, v in d.items():
    if v > max_val:
      max_val = v
      max_key = k
  return max_key, max_val

def play_game(grid, policy):
  # returns a list of states and corresponding returns
  # use an epsilon-soft policy
  s = (2, 0)
  grid.set_state(s)
  a = random_action(policy[s]) # 就是原定的policy执行概率为0.9,但其它五个动作也有概率执行

  # each triple is s(t), a(t), r(t)
  # but r(t) results from taking action a(t-1) from s(t-1) and landing in s(t)
  states_actions_rewards = [(s, a, 0)]
  while True:
    r = grid.move(a)
    s = grid.current_state()
    if grid.game_over():  #没有first_visit的条件了
      states_actions_rewards.append((s, None, r))
      break
    else:
      a = random_action(policy[s]) # the next state is stochastic
      states_actions_rewards.append((s, a, r))

  # calculate the returns by working backwards from the terminal state
  G = 0
  states_actions_returns = []
  first = True
  for s, a, r in reversed(states_actions_rewards):
    # the value of the terminal state is 0 by definition
    # we should ignore the first state we encounter
    # and ignore the last G, which is meaningless since it doesn't correspond to any move
    if first:
      first = False
    else:
      states_actions_returns.append((s, a, G))
    G = r + GAMMA*G
  states_actions_returns.reverse() # we want it to be in order of state visited
  return states_actions_returns

grid = negative_grid(step_cost = -0.1)
policy = {}
for s in grid.actions.keys():
  policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)
# initialize Q(s,a) and returns
Q = {} # 储存平均值
returns = {} # dictionary of state -> list of returns we've received
states = grid.all_states()
for s in states:
  if s in grid.actions: # not a terminal state
    Q[s] = {}
    for a in ALL_POSSIBLE_ACTIONS:
      Q[s][a] = 0
      returns[(s,a)] = []
  else:
    # terminal state or state we can't otherwise get to
    pass


deltas = []
for t in range(5000):
  # generate an episode using pi
  biggest_change = 0
  states_actions_returns = play_game(grid, policy)

  # calculate Q(s,a)
  seen_state_action_pairs = set()
  for s, a, G in states_actions_returns:
    # check if we have already seen s
    # called "first-visit" MC policy evaluation
    sa = (s, a)
    if sa not in seen_state_action_pairs:
      old_q = Q[s][a]
      returns[sa].append(G)
      Q[s][a] = np.mean(returns[sa])
      biggest_change = max(biggest_change, np.abs(old_q - Q[s][a]))
      seen_state_action_pairs.add(sa)
  deltas.append(biggest_change)

  # calculate new policy pi(s) = argmax[a]{ Q(s,a) }
  for s in policy.keys():
    a, _ = max_dict(Q[s])
    policy[s] = a

V = {}
for s in policy.keys():
    V[s] = max_dict(Q[s])[1]

print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)

不同之处(1) . s = (2, 0),只有一个初始点
(2) a = random_action(policy[s]) action非固定

3.结果:


在这里插入图片描述
即使调大epsilon,结果也依然很好包括(2,3),一个原因是该例子比较简单,其次不用从(2,3)出发,从进入(2,3)的那一刻开始,系统就告诉我们不要进去,所以(2,3)表现良好。

总结


比起前面两个MC性能更加优越,结合了epsilon-greedy算法。

摘录于:https://www.bilibili.com/video/BV1sd4y167NS?p=22&vd_source=35775f5151031f11ee2a799855c8e368
https://github.com/rookiexxj/Reinforcement_learning_tutorial_with_demo/blob/master/monte_carlo_epsilon_greedy_demo.ipynb

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

rookiexxj01

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值