提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
前言
个人笔记,恳请纠错,请勿转载
一、On policy & off policy
首先要弄清 on policy 和 off policy 的区别,为什么呢?
因为这就是Q-learning和Sarsa的区别,首先明确Q-learning是off-policy,Sarsa是on-policy。
1.定义:
When the behavior policy is the same as the target policy, such a learning process is called on-policy.Otherwise, when they are different, the learning process is called off-policy.
tip:behavior policy 就是与环境交互的策略;target policy就是目标策略或实际策略,不与环境交互。
具体的区别在于对a_t+1的选择不同:
Sarsa中a_t+1与a_t一样都是由Πb决定,Πb,即behavior policy可理解为Sarsa中选取action的策略ε-greedy。on policy理解为a_t和a_t+1都由该策略选取。
Q-learning中仅a_t由Πb选取,a_t+1由target behavior选取。那target behavior是什么呢?往下看。
看完定义请先跳到代码部分,观看一遍代码。
2.代码对比
主要对比对a2的选择
Sarsa:
r = grid.move(a)
s2 = grid.current_state()
# we need the next action as well since Q(s,a) depends on Q(s',a')
# if s2 not in policy then it's a terminal state, all Q are 0
a2 = max_dict(Q[s2])[0]
a2 = random_action(a2, eps=0.5/t) # epsilon-greedy
很明显可以看到a2选择了前面定义的random_action,即ε-greedy action。
Q-learning:
a2, max_q_s2a2 = max_dict(Q[s2])
Q[s][a] = Q[s][a] + alpha*(r + GAMMA*max_q_s2a2 - Q[s][a])
biggest_change = max(biggest_change, np.abs(old_qsa - Q[s][a]))
可以看到a2由s2经迭代产生的现有最大action value确定。现在思考target behavior是什么!
target是什么? 最大action value。那么这里的target behavior是什么?就是使用s2位置目前拥有最大action value的action去迭代。
3.总结:
(1)off-policy的最简单解释: the learning is from the data off the target policy。
(2) q-learning解决的是贝尔曼最优公式问题,所以他不需要exploration,一直选择最优解就行。
二、code
import numpy as np
import matplotlib.pyplot as plt
from gridWorldGame import standard_grid, negative_grid,print_values, print_policy
SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
ALPHA = 0.1
def max_dict(d):
# returns the argmax (key) and max (value) from a dictionary
max_key = None
max_val = float('-inf')
for k, v in d.items():
if v > max_val:
max_val = v
max_key = k
return max_key, max_val
def random_action(a, eps=0.1):
# epsilon-soft to ensure all states are visited
p = np.random.random()
if p < (1 - eps):
return a
else:
return np.random.choice(ALL_POSSIBLE_ACTIONS)
grid = negative_grid(step_cost=-0.1)
Q = {}
states = grid.all_states()
for s in states:
Q[s] = {}
for a in ALL_POSSIBLE_ACTIONS:
Q[s][a] = 0
update_counts = {}
update_counts_sa = {}
for s in states:
update_counts_sa[s] = {}
for a in ALL_POSSIBLE_ACTIONS:
update_counts_sa[s][a] = 1.0
t = 1.0
deltas = []
for it in range(10000):
if it % 100 == 0:
t += 1e-2
if it % 2000 == 0:
print("iteration:", it)
s = (2, 0) # start state
grid.set_state(s)
a, _ = max_dict(Q[s])
biggest_change = 0
while not grid.game_over():
a = random_action(a, eps=0.5 / t) # epsilon-greedy
r = grid.move(a)
s2 = grid.current_state()
alpha = ALPHA / update_counts_sa[s][a]
update_counts_sa[s][a] += 0.005
old_qsa = Q[s][a]
a2, max_q_s2a2 = max_dict(Q[s2])
Q[s][a] = Q[s][a] + alpha * (r + GAMMA * max_q_s2a2 - Q[s][a])
biggest_change = max(biggest_change, np.abs(old_qsa - Q[s][a]))
update_counts[s] = update_counts.get(s, 0) + 1
s = s2
a = a2
deltas.append(biggest_change)
policy = {}
V = {}
for s in grid.actions.keys():
a, max_q = max_dict(Q[s])
policy[s] = a
V[s] = max_q
print("update counts:")
total = np.sum(list(update_counts.values()))
for k, v in update_counts.items():
update_counts[k] = float(v) / total
print_values(update_counts, grid)
print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)
总结
下面是赵老师整理的表格,感谢老师的奉献。
摘录于:https://github.com/rookiexxj/Reinforcement_learning_tutorial_with_demo/blob/master/q_learning_demo.ipynb
https://www.bilibili.com/video/BV1sd4y167NS/