强化学习8--Q-learning

rookiexxj01

于 2023-10-07 18:07:23 发布

阅读量190

点赞数

文章标签： python 机器学习

本文链接：https://blog.csdn.net/rookiexxj/article/details/133640596

版权

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、On policy & off policy
二、code
总结

前言

个人笔记，恳请纠错，请勿转载

一、On policy & off policy

首先要弄清 on policy 和 off policy 的区别，为什么呢？
因为这就是Q-learning和Sarsa的区别，首先明确Q-learning是off-policy，Sarsa是on-policy。
1.定义：
When the behavior policy is the same as the target policy, such a learning process is called on-policy.Otherwise, when they are different, the learning process is called off-policy.

tip：behavior policy 就是与环境交互的策略；target policy就是目标策略或实际策略，不与环境交互。
具体的区别在于对a_t+1的选择不同：
在这里插入图片描述
Sarsa中a_t+1与a_t一样都是由Πb决定，Πb，即behavior policy可理解为Sarsa中选取action的策略ε-greedy。on policy理解为a_t和a_t+1都由该策略选取。

Q-learning中仅a_t由Πb选取，a_t+1由target behavior选取。那target behavior是什么呢？往下看。
看完定义请先跳到代码部分，观看一遍代码。

2.代码对比
主要对比对a2的选择
Sarsa：

    r = grid.move(a)
    s2 = grid.current_state()

    # we need the next action as well since Q(s,a) depends on Q(s',a')
    # if s2 not in policy then it's a terminal state, all Q are 0
    a2 = max_dict(Q[s2])[0]
    a2 = random_action(a2, eps=0.5/t) # epsilon-greedy

很明显可以看到a2选择了前面定义的random_action，即ε-greedy action。
Q-learning：

    a2, max_q_s2a2 = max_dict(Q[s2])
    Q[s][a] = Q[s][a] + alpha*(r + GAMMA*max_q_s2a2 - Q[s][a])
    biggest_change = max(biggest_change, np.abs(old_qsa - Q[s][a]))

可以看到a2由s2经迭代产生的现有最大action value确定。现在思考target behavior是什么！
target是什么？最大action value。那么这里的target behavior是什么？就是使用s2位置目前拥有最大action value的action去迭代。
3.总结：
(1)off-policy的最简单解释: the learning is from the data off the target policy。
(2) q-learning解决的是贝尔曼最优公式问题，所以他不需要exploration，一直选择最优解就行。

二、code

import numpy as np
import matplotlib.pyplot as plt
from gridWorldGame import standard_grid, negative_grid,print_values, print_policy

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
ALPHA = 0.1

def max_dict(d):
  # returns the argmax (key) and max (value) from a dictionary
  max_key = None
  max_val = float('-inf')
  for k, v in d.items():
    if v > max_val:
      max_val = v
      max_key = k
  return max_key, max_val

def random_action(a, eps=0.1):
  # epsilon-soft to ensure all states are visited
  p = np.random.random()
  if p < (1 - eps):
    return a
  else:
    return np.random.choice(ALL_POSSIBLE_ACTIONS)

grid = negative_grid(step_cost=-0.1)
Q = {}
states = grid.all_states()
for s in states:
  Q[s] = {}
  for a in ALL_POSSIBLE_ACTIONS:
    Q[s][a] = 0

update_counts = {}
update_counts_sa = {}
for s in states:
  update_counts_sa[s] = {}
  for a in ALL_POSSIBLE_ACTIONS:
    update_counts_sa[s][a] = 1.0

t = 1.0
deltas = []
for it in range(10000):
  if it % 100 == 0:
    t += 1e-2
  if it % 2000 == 0:
    print("iteration:", it)

  s = (2, 0)  # start state
  grid.set_state(s)
  a, _ = max_dict(Q[s])
  biggest_change = 0
  while not grid.game_over():
    a = random_action(a, eps=0.5 / t)  # epsilon-greedy
    r = grid.move(a)
    s2 = grid.current_state()
    alpha = ALPHA / update_counts_sa[s][a]
    update_counts_sa[s][a] += 0.005
    old_qsa = Q[s][a]
    a2, max_q_s2a2 = max_dict(Q[s2])
    Q[s][a] = Q[s][a] + alpha * (r + GAMMA * max_q_s2a2 - Q[s][a])
    biggest_change = max(biggest_change, np.abs(old_qsa - Q[s][a]))
    update_counts[s] = update_counts.get(s, 0) + 1
    s = s2
    a = a2

  deltas.append(biggest_change)
policy = {}
V = {}
for s in grid.actions.keys():
  a, max_q = max_dict(Q[s])
  policy[s] = a
  V[s] = max_q
print("update counts:")
total = np.sum(list(update_counts.values()))
for k, v in update_counts.items():
  update_counts[k] = float(v) / total
print_values(update_counts, grid)
print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)

总结

下面是赵老师整理的表格，感谢老师的奉献。
在这里插入图片描述
摘录于：https://github.com/rookiexxj/Reinforcement_learning_tutorial_with_demo/blob/master/q_learning_demo.ipynb
https://www.bilibili.com/video/BV1sd4y167NS/