强化学习5--MC epsilon-greedy

rookiexxj01

已于 2023-09-29 11:21:32 修改

阅读量455

点赞数

分类专栏：强化学习文章标签： python 机器学习

于 2023-09-29 11:19:12 首次发布

本文链接：https://blog.csdn.net/rookiexxj/article/details/133412573

版权

强化学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、epsilon-greedy
二、代码
总结

前言

个人笔记，请勿转载，恳求纠错
在这里插入图片描述

一、epsilon-greedy

1.epsilon-greedy是什么？
在这里插入图片描述
看过reinforcemen learning：An introduction的第二章，greedy其实就是最大概率，也就是对于多个action，有最大概率1-epsilon选择最优action，其它action分另外的概率。

2.MC中为什么使用epsilon-greedy，由MC-es的两个缺点：(1)对于每一个episode是frist_visit，意味着我需要分别取遍所有的s,a这是非常低效的；而epsilon-greedy可以使我们只从一个s出发，只要action次数够多，就必定会走完所有s，a，提高了效率捏。
(2)上一篇笔记最后有讲(2,3)这个s如果action分布得太过随机，由于它三面碰壁，迭代次数很多时，state_value会无限接近-1。epsilon-greedy可以改变这个情况。

二、代码

1.伪代码

代码如下（示例）：
与MC-es只有遍历条件，和最后的policy improvement有所不同

2.代码

import numpy as np
import matplotlib.pyplot as plt
from gridWorldGame import standard_grid, negative_grid,print_values, print_policy

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

def random_action(a, eps=0.9): # epsilon-action选择
  # choose given a with probability 1 - eps + eps/4
  p = np.random.random()
  if p < (1 - eps):
    return a
  else:
    return np.random.choice(ALL_POSSIBLE_ACTIONS)

def max_dict(d):
  # returns the argmax (key) and max (value) from a dictionary
  max_key = None
  max_val = float('-inf')
  for k, v in d.items():
    if v > max_val:
      max_val = v
      max_key = k
  return max_key, max_val

def play_game(grid, policy):
  # returns a list of states and corresponding returns
  # use an epsilon-soft policy
  s = (2, 0)
  grid.set_state(s)
  a = random_action(policy[s]) # 就是原定的policy执行概率为0.9，但其它五个动作也有概率执行

  # each triple is s(t), a(t), r(t)
  # but r(t) results from taking action a(t-1) from s(t-1) and landing in s(t)
  states_actions_rewards = [(s, a, 0)]
  while True:
    r = grid.move(a)
    s = grid.current_state()
    if grid.game_over():  #没有first_visit的条件了
      states_actions_rewards.append((s, None, r))
      break
    else:
      a = random_action(policy[s]) # the next state is stochastic
      states_actions_rewards.append((s, a, r))

  # calculate the returns by working backwards from the terminal state
  G = 0
  states_actions_returns = []
  first = True
  for s, a, r in reversed(states_actions_rewards):
    # the value of the terminal state is 0 by definition
    # we should ignore the first state we encounter
    # and ignore the last G, which is meaningless since it doesn't correspond to any move
    if first:
      first = False
    else:
      states_actions_returns.append((s, a, G))
    G = r + GAMMA*G
  states_actions_returns.reverse() # we want it to be in order of state visited
  return states_actions_returns

grid = negative_grid(step_cost = -0.1)
policy = {}
for s in grid.actions.keys():
  policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)
# initialize Q(s,a) and returns
Q = {} # 储存平均值
returns = {} # dictionary of state -> list of returns we've received
states = grid.all_states()
for s in states:
  if s in grid.actions: # not a terminal state
    Q[s] = {}
    for a in ALL_POSSIBLE_ACTIONS:
      Q[s][a] = 0
      returns[(s,a)] = []
  else:
    # terminal state or state we can't otherwise get to
    pass


deltas = []
for t in range(5000):
  # generate an episode using pi
  biggest_change = 0
  states_actions_returns = play_game(grid, policy)

  # calculate Q(s,a)
  seen_state_action_pairs = set()
  for s, a, G in states_actions_returns:
    # check if we have already seen s
    # called "first-visit" MC policy evaluation
    sa = (s, a)
    if sa not in seen_state_action_pairs:
      old_q = Q[s][a]
      returns[sa].append(G)
      Q[s][a] = np.mean(returns[sa])
      biggest_change = max(biggest_change, np.abs(old_q - Q[s][a]))
      seen_state_action_pairs.add(sa)
  deltas.append(biggest_change)

  # calculate new policy pi(s) = argmax[a]{ Q(s,a) }
  for s in policy.keys():
    a, _ = max_dict(Q[s])
    policy[s] = a

V = {}
for s in policy.keys():
    V[s] = max_dict(Q[s])[1]

print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)

不同之处(1) . s = (2, 0),只有一个初始点
(2) a = random_action(policy[s]) action非固定

3.结果：

— 在这里插入图片描述
即使调大epsilon，结果也依然很好包括(2,3),一个原因是该例子比较简单，其次不用从(2,3)出发，从进入(2,3)的那一刻开始，系统就告诉我们不要进去，所以(2,3)表现良好。

总结

比起前面两个MC性能更加优越，结合了epsilon-greedy算法。

摘录于：https://www.bilibili.com/video/BV1sd4y167NS?p=22&vd_source=35775f5151031f11ee2a799855c8e368
https://github.com/rookiexxj/Reinforcement_learning_tutorial_with_demo/blob/master/monte_carlo_epsilon_greedy_demo.ipynb