强化学习8--Q-learning

提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档


前言


个人笔记,恳请纠错,请勿转载

一、On policy & off policy

首先要弄清 on policy 和 off policy 的区别,为什么呢?
因为这就是Q-learning和Sarsa的区别,首先明确Q-learning是off-policy,Sarsa是on-policy。
1.定义:
When the behavior policy is the same as the target policy, such a learning process is called on-policy.Otherwise, when they are different, the learning process is called off-policy.

tip:behavior policy 就是与环境交互的策略;target policy就是目标策略或实际策略,不与环境交互。
具体的区别在于对a_t+1的选择不同:
在这里插入图片描述
Sarsa中a_t+1与a_t一样都是由Πb决定,Πb,即behavior policy可理解为Sarsa中选取action的策略ε-greedy。on policy理解为a_t和a_t+1都由该策略选取。
在这里插入图片描述

Q-learning中仅a_t由Πb选取,a_t+1由target behavior选取。那target behavior是什么呢?往下看。
看完定义请先跳到代码部分,观看一遍代码。

2.代码对比
主要对比对a2的选择
Sarsa:

    r = grid.move(a)
    s2 = grid.current_state()

    # we need the next action as well since Q(s,a) depends on Q(s',a')
    # if s2 not in policy then it's a terminal state, all Q are 0
    a2 = max_dict(Q[s2])[0]
    a2 = random_action(a2, eps=0.5/t) # epsilon-greedy

很明显可以看到a2选择了前面定义的random_action,即ε-greedy action。
Q-learning:

    a2, max_q_s2a2 = max_dict(Q[s2])
    Q[s][a] = Q[s][a] + alpha*(r + GAMMA*max_q_s2a2 - Q[s][a])
    biggest_change = max(biggest_change, np.abs(old_qsa - Q[s][a]))

可以看到a2由s2经迭代产生的现有最大action value确定。现在思考target behavior是什么!
target是什么? 最大action value。那么这里的target behavior是什么?就是使用s2位置目前拥有最大action value的action去迭代。
3.总结:
(1)off-policy的最简单解释: the learning is from the data off the target policy。
(2) q-learning解决的是贝尔曼最优公式问题,所以他不需要exploration,一直选择最优解就行。

二、code

import numpy as np
import matplotlib.pyplot as plt
from gridWorldGame import standard_grid, negative_grid,print_values, print_policy

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
ALPHA = 0.1

def max_dict(d):
  # returns the argmax (key) and max (value) from a dictionary
  max_key = None
  max_val = float('-inf')
  for k, v in d.items():
    if v > max_val:
      max_val = v
      max_key = k
  return max_key, max_val

def random_action(a, eps=0.1):
  # epsilon-soft to ensure all states are visited
  p = np.random.random()
  if p < (1 - eps):
    return a
  else:
    return np.random.choice(ALL_POSSIBLE_ACTIONS)

grid = negative_grid(step_cost=-0.1)
Q = {}
states = grid.all_states()
for s in states:
  Q[s] = {}
  for a in ALL_POSSIBLE_ACTIONS:
    Q[s][a] = 0

update_counts = {}
update_counts_sa = {}
for s in states:
  update_counts_sa[s] = {}
  for a in ALL_POSSIBLE_ACTIONS:
    update_counts_sa[s][a] = 1.0

t = 1.0
deltas = []
for it in range(10000):
  if it % 100 == 0:
    t += 1e-2
  if it % 2000 == 0:
    print("iteration:", it)

  s = (2, 0)  # start state
  grid.set_state(s)
  a, _ = max_dict(Q[s])
  biggest_change = 0
  while not grid.game_over():
    a = random_action(a, eps=0.5 / t)  # epsilon-greedy
    r = grid.move(a)
    s2 = grid.current_state()
    alpha = ALPHA / update_counts_sa[s][a]
    update_counts_sa[s][a] += 0.005
    old_qsa = Q[s][a]
    a2, max_q_s2a2 = max_dict(Q[s2])
    Q[s][a] = Q[s][a] + alpha * (r + GAMMA * max_q_s2a2 - Q[s][a])
    biggest_change = max(biggest_change, np.abs(old_qsa - Q[s][a]))
    update_counts[s] = update_counts.get(s, 0) + 1
    s = s2
    a = a2

  deltas.append(biggest_change)
policy = {}
V = {}
for s in grid.actions.keys():
  a, max_q = max_dict(Q[s])
  policy[s] = a
  V[s] = max_q
print("update counts:")
total = np.sum(list(update_counts.values()))
for k, v in update_counts.items():
  update_counts[k] = float(v) / total
print_values(update_counts, grid)
print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)

总结

下面是赵老师整理的表格,感谢老师的奉献。
在这里插入图片描述
摘录于:https://github.com/rookiexxj/Reinforcement_learning_tutorial_with_demo/blob/master/q_learning_demo.ipynb
https://www.bilibili.com/video/BV1sd4y167NS/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

rookiexxj01

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值