强化学习系列
文章目录
前言
Q-Learning是一种与模型无关的强化学习算法,直接优化一个可迭代计算的Q函数。
链接:https://pan.baidu.com/s/1f-WgZmFaNiD-1Mvgg7lPHQ
提取码:k5ok
一、初步认识Q-Learning
1.场景模拟
假设有如下一个场景,一个小孩正在学习,他有两个选择:一是好的行为,继续写作业,直到写完它就可以得到奖励。二是不好的行为,不写完作业就去看电视,被发现后就爸妈惩罚了。
(明确几个名词,小孩个体也被称作为agent,不同的行为被称为action1,action2,不同的结果被称为result1,result2)
这与Q-Learning算法又有什么关系呢?
假设我们现在第一次处于写作业的状态,所以我们并不了解写作业与看电视的后果。因此我们选择去看电视,然后我们接着看电视,还是看电视,最后因为没有完成任务而被爸妈惩罚了。
有了第一次的经历,我们将没有写完作业就去看电视更改为负面行为。
Q-Learning在经过多次经历后又是怎么决策的呢?
行为准则Q表:一张包含了不同状态下,对应的不同动作的q值的Table表。q值即不同动作的效益。
假设我们的行为准则Q表
已经学习好了
我们处于状态s1,有两个行为a1,a2,根据我们的经验,s1状态下的a2带来的潜在奖励要比a1高。可以用 行为准则Q表 来表示,即Q(s1,a2)> Q(s1,a1)
因此我们选择a2行为,并将状态更新为s2.
同样重复上面的选择,在行为准则Q表中选择较大的值,根据较大的
Q
(
s
2
,
a
2
)
Q(s2,a2)
Q(s2,a2)选择a2行为,并将状态更新为s3。
这样我们就完成状态s的转变。
那我们来看下Q表是如何更新的。
当我们在s1状态下选择a2行为,来到了s2状态。这时我们开始更新用于决策的Q表。
我们并没有在实际中采取任何行动,而是想象在s2中的每种行动,分别看看哪种行为的q值比较大。目的是为了估计未来的选择。
比如 Q ( s 2 , a 2 ) > Q ( s 1 , a 1 ) Q(s2,a2)> Q(s1,a1) Q(s2,a2)>Q(s1,a1)),所以我们会把 m a x Q ( s 2 ) = Q ( s 2 , a 2 ) maxQ(s2)= Q(s2,a2) maxQ(s2)=Q(s2,a2)乘上衰减值 γ = 0.9 \gamma=0.9 γ=0.9,并加上到达s2时所获得奖励R
(因为还没有获得棒棒糖,所以奖励为0。若获得棒棒糖,则奖励为100,若被惩罚了,则奖励为-10)。
因为我们获得实实在在的奖励R,所以我们将 R + γ ∗ m a x Q ( s 2 ) R+\gamma*maxQ(s2) R+γ∗maxQ(s2)称为现实值。但是我们之前通过Q表估计了 Q ( s 1 , a 2 ) Q(s1,a2) Q(s1,a2)的估计值。
有了现实和估计值,我们就可以更新 Q ( s 1 , a 2 ) Q(s1,a2) Q(s1,a2)。
根据现实与估计值的差距,将这个差距乘以学习效率 α \alpha α,累加上原来的 Q ( s 1 , a 2 ) Q(s1,a2) Q(s1,a2),变成新的值。
注:虽然我们用了
m
a
x
Q
(
s
2
)
maxQ(s2)
maxQ(s2)来估计下一个s2的状态,但没有对s2做出任何的行为。s2的行为要等到更新完,再重新来决策。
后面会有对公式进一步的解释说明。
这就是off-policy的Q-Learning是如何学习,优化,决策的过程
2.Q-Learning的算法实现
Q-Learning的代码实现与流程图解释
Initialize Q ( s , a ) , ∀ s ∈ S , a ∈ A ( s ) Q(s,a),∀s∈S,a∈A(s) Q(s,a),∀s∈S,a∈A(s)arbitrarily
Repeat (for each episode):
Initialize S S S
Repeat (for each step of episode):
Choose A A A from S S S using policy derived from Q Q Q (e.g. ϵ ϵ ϵ−greedy)
Take action A A A, observe R , S ′ R,S' R,S′
Q ( S , A ) ← Q ( S , A ) + α [ R + γ m a x a Q ( S ′ , A ′ ) − Q ( S , A ) ] Q(S,A)←Q(S,A)+α[R+γmaxa Q(S',A')−Q(S,A)] Q(S,A)←Q(S,A)+α[R+γmaxaQ(S′,A′)−Q(S,A)]
S ← S ′ S←S' S←S′;
Until S S S is terminal
部分参数解释
-
ϵ
−
g
r
e
e
d
y
∈
(
0
,
1
)
\epsilon-greedy \in ( 0,1)
ϵ−greedy∈(0,1)
ϵ − g r e e d y \epsilon-greedy ϵ−greedy是使用在决策上的一种策略。
假设 ϵ − g r e e d y = 0.9 \epsilon-greedy = 0.9 ϵ−greedy=0.9时,说明90%的情况下,我们会按照Q表的最优值来选择行为,10%的情况下会随机选择行为。 -
α
∈
(
0
,
1
)
\alpha \in ( 0,1)
α∈(0,1)
α \alpha α是学习效率,来决定这一次误差有多少要被学习。 -
γ
∈
(
0
,
1
)
\gamma\in ( 0,1)
γ∈(0,1)
γ \gamma γ是对未来想象的衰减值。
递归代入 Q ( s 1 ) Q(s1) Q(s1)
Q ( s 1 ) = r 2 + γ Q ( s 2 ) = r 2 + γ [ r 3 + γ Q ( s 3 ) ] = r 2 + γ [ r 3 + γ [ r 4 + γ Q ( s 4 ) ] ] = . . . . . . Q(s1)=r2+\gamma Q(s2) = r2+\gamma [r3+\gamma Q(s3)]=r2+\gamma [r3+\gamma [r4+\gamma Q(s4)]]=...... Q(s1)=r2+γQ(s2)=r2+γ[r3+γQ(s3)]=r2+γ[r3+γ[r4+γQ(s4)]]=......
Q ( s 1 ) = r 2 + γ r 3 + γ 2 r 4 + γ 3 r 5 + γ 4 r 6 + . . . Q(s1)=r2+\gamma r3 +\gamma ^{2} r4 +\gamma ^{3} r5+\gamma ^{4} r6 +... Q(s1)=r2+γr3+γ2r4+γ3r5+γ4r6+...
当 γ = 1 \gamma=1 γ=1时, Q ( s 1 ) = r 2 + 1 ∗ r 3 + 1 ∗ r 4 + 1 ∗ r 5 + 1 ∗ r 6 + . . . Q(s1)=r2+1*r3+1*r4+1 *r5+1*r6+... Q(s1)=r2+1∗r3+1∗r4+1∗r5+1∗r6+...
当 γ = 0.6 \gamma=0.6 γ=0.6时, Q ( s 1 ) = r 2 + 0.6 ∗ r 3 + 0.36 ∗ r 4 + 0.216 ∗ r 5 + 0.1296 ∗ r 6 + . . . Q(s1)=r2+0.6*r3+0.36*r4+0.216 *r5+0.1296*r6+... Q(s1)=r2+0.6∗r3+0.36∗r4+0.216∗r5+0.1296∗r6+...
当 γ = 0 \gamma=0 γ=0时, Q ( s 1 ) = r 2 Q(s1)=r2 Q(s1)=r2
由此可见, γ \gamma γ越接近0,就会更加在乎眼前的奖励, γ \gamma γ越接近1,就会看清未来的奖励,所以agent渐渐变得有远见。
二、实例测试(非原创)
1.Demo
import numpy as np
import pandas as pd
import time
np.random.seed(2)
N_STATE = 6
ACTIONS = ['left', 'right']
EPSILON = 0.9
ALPHA = 0.1
LAMBDA = 0.9
MAX_EPISODES = 15
FRESH_TIME = 0.1
def build_q_table(n_state, actions):
table = pd.DataFrame(
np.zeros((n_state, len(actions))),
columns=actions
)
return table
def choose_action(state, q_table):
state_actions = q_table.iloc[state, :]
if (np.random.uniform() > EPSILON) or (state_actions.all() == 0):
action_name = np.random.choice(ACTIONS)
else:
action_name = state_actions.values.argmax()
return action_name
def get_env_feedback(S, A):
if A == 'right': # move right
if S == N_STATE - 2:
S_ = 'terminal'
R = 1
else:
S_ = S + 1
R = 0
else: # move left
R = 0
if S == 0:
S_ = S
else:
S_ = S - 1
return S_, R
def update_env(S, episode, step_counter):
env_list = ['-'] * (N_STATE - 1) + ['T'] # '-----T' environment
if S == 'terminal':
interaction = 'Episode %s: total_steps = %s' % (episode + 1, step_counter)
print('\r{}'.format(interaction), end='')
time.sleep(2)
print('\r ', end='')
else:
env_list[S] = 'o'
interaction = ''.join(env_list)
print('\r{}'.format(interaction), end='')
time.sleep(FRESH_TIME)
def rl():
# main part of RL
Q_table = build_q_table(N_STATE, ACTIONS)
for episode in range(MAX_EPISODES):
step_counter = 0
S = 0
is_terminated = False
update_env(S, episode, step_counter)
while not is_terminated:
A = choose_action(S, Q_table)
S_, R = get_env_feedback(S, A)
q_predict = Q_table.ix[S, A]
if S_ != 'terminal':
q_target = R + LAMBDA * Q_table.iloc[S_, :].max()
else:
q_target = R
is_terminated = True
Q_table.ix[S, A] += ALPHA * (q_target - q_predict)
S = S_
update_env(S, episode, step_counter + 1)
step_counter += 1
return Q_table
if __name__ == "__main__":
q_table = rl()
print('\r\nQ_table:\n')
print(q_table)
2.Maze_Test
环境文件:
import time
import numpy as np
import tkinter as tk
from PIL import ImageTk, Image
np.random.seed(1)
PhotoImage = ImageTk.PhotoImage
UNIT = 100
HEIGHT = 5
WIDTH = 5
class Env(tk.Tk):
def __init__(self):
super(Env, self).__init__()
self.action_space = ['u', 'd', 'l', 'r']
self.n_actions = len(self.action_space)
self.title('Q Learning')
self.geometry('{0}x{1}'.format(HEIGHT * UNIT, HEIGHT * UNIT))
self.shapes = self.load_images()
self.canvas = self._build_canvas()
self.texts = []
def _build_canvas(self):
canvas = tk.Canvas(self, bg='white',
height=HEIGHT * UNIT,
width=WIDTH * UNIT)
# create grids
for c in range(0, WIDTH * UNIT, UNIT): # 0~400 by 80
x0, y0, x1, y1 = c, 0, c, HEIGHT * UNIT
canvas.create_line(x0, y0, x1, y1)
for r in range(0, HEIGHT * UNIT, UNIT): # 0~400 by 80
x0, y0, x1, y1 = 0, r, HEIGHT * UNIT, r
canvas.create_line(x0, y0, x1, y1)
# add img to canvas
self.rectangle = canvas.create_image(50, 50, image=self.shapes[0])
self.triangle1 = canvas.create_image(250, 150, image=self.shapes[1])
self.triangle2 = canvas.create_image(150, 250, image=self.shapes[1])
self.circle = canvas.create_image(250, 250, image=self.shapes[2])
# pack all
canvas.pack()
return canvas
def load_images(self):
rectangle = PhotoImage(
Image.open("../img/rectangle.png").resize((65, 65)))
triangle = PhotoImage(
Image.open("../img/triangle.png").resize((65, 65)))
circle = PhotoImage(
Image.open("../img/circle.png").resize((65, 65)))
return rectangle, triangle, circle
def text_value(self, row, col, contents, action, font='Helvetica', size=10,
style='normal', anchor="nw"):
if action == 0:
origin_x, origin_y = 7, 42
elif action == 1:
origin_x, origin_y = 85, 42
elif action == 2:
origin_x, origin_y = 42, 5
else:
origin_x, origin_y = 42, 77
x, y = origin_y + (UNIT * col), origin_x + (UNIT * row)
font = (font, str(size), style)
text = self.canvas.create_text(x, y, fill="black", text=contents,
font=font, anchor=anchor)
return self.texts.append(text)
def print_value_all(self, q_table):
for i in self.texts:
self.canvas.delete(i)
self.texts.clear()
for i in range(HEIGHT):
for j in range(WIDTH):
for action in range(0, 4):
state = [i, j]
if str(state) in q_table.keys():
temp = q_table[str(state)][action]
self.text_value(j, i, round(temp, 2), action)
def coords_to_state(self, coords):
x = int((coords[0] - 50) / 100)
y = int((coords[1] - 50) / 100)
return [x, y]
def state_to_coords(self, state):
x = int(state[0] * 100 + 50)
y = int(state[1] * 100 + 50)
return [x, y]
def reset(self):
self.update()
time.sleep(0.5)
x, y = self.canvas.coords(self.rectangle)
self.canvas.move(self.rectangle, UNIT / 2 - x, UNIT / 2 - y)
self.render()
# return observation
return self.coords_to_state(self.canvas.coords(self.rectangle))
def step(self, action):
state = self.canvas.coords(self.rectangle)
base_action = np.array([0, 0])
self.render()
if action == 0: # up
if state[1] > UNIT:
base_action[1] -= UNIT
elif action == 1: # down
if state[1] < (HEIGHT - 1) * UNIT:
base_action[1] += UNIT
elif action == 2: # left
if state[0] > UNIT:
base_action[0] -= UNIT
elif action == 3: # right
if state[0] < (WIDTH - 1) * UNIT:
base_action[0] += UNIT
# 移动
self.canvas.move(self.rectangle, base_action[0], base_action[1])
self.canvas.tag_raise(self.rectangle)
next_state = self.canvas.coords(self.rectangle)
# 判断得分条件
if next_state == self.canvas.coords(self.circle):
reward = 100
done = True
elif next_state in [self.canvas.coords(self.triangle1),
self.canvas.coords(self.triangle2)]:
reward = -100
done = True
else:
reward = 0
done = False
next_state = self.coords_to_state(next_state)
return next_state, reward, done
# 渲染环境
def render(self):
time.sleep(0.03)
self.update()
q-learning-agent
import numpy as np
import random
from environment import Env
from collections import defaultdict
class QLearningAgent:
def __init__(self, actions):
# actions = [0, 1, 2, 3]
self.actions = actions
self.learning_rate = 0.01
self.discount_factor = 0.9
self.epsilon = 0.1
self.q_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])
# 采样 <s, a, r, s'>
def learn(self, state, action, reward, next_state):
current_q = self.q_table[state][action]
# 贝尔曼方程更新
new_q = reward + self.discount_factor * max(self.q_table[next_state])
self.q_table[state][action] += self.learning_rate * (new_q - current_q)
# 从Q-table中选取动作
def get_action(self, state):
if np.random.rand() < self.epsilon:
# 贪婪策略随机探索动作
action = np.random.choice(self.actions)
else:
# 从q表中选择
state_action = self.q_table[state]
action = self.arg_max(state_action)
return action
@staticmethod
def arg_max(state_action):
max_index_list = []
max_value = state_action[0]
for index, value in enumerate(state_action):
if value > max_value:
max_index_list.clear()
max_value = value
max_index_list.append(index)
elif value == max_value:
max_index_list.append(index)
return random.choice(max_index_list)
if __name__ == "__main__":
env = Env()
agent = QLearningAgent(actions=list(range(env.n_actions)))
for episode in range(1000):
print(episode)
state = env.reset()
while True:
env.render()
# agent产生动作
action = agent.get_action(str(state))
next_state, reward, done = env.step(action)
# 更新Q表
agent.learn(str(state), action, reward, str(next_state))
state = next_state
env.print_value_all(agent.q_table)
# 当到达终点就终止游戏开始新一轮训练
if done:
break
三、小结
Q Learning算法优点:
-
所需的参数少;
-
不需要环境的模型;
-
不局限于episode task;
-
可以采用离线的实现方式;
-
可以保证收敛到 qπ。
Q Learning算法缺点:
-
Q-learning使用了max,会引起一个最大化偏差问题;
-
可能会出现更新速度慢;
-
可能会出现预见能力不强。