(0)前言
本文主要介绍了强化学习的Q-Learning到Deep Q-Learning,并将代码分别应用到了FlappyBird小游戏中,文末给出了所有参考资料。
(1)强化学习
有机体在与环境不断交互的过程中,在与环境给予的奖励(reward)或惩罚的刺激下,逐步形成对刺激的预期,产生能获得最大利益的习惯性行为。
一般而言强化学习要考虑:
- 环境状态的集合 Ω S {\displaystyle \Omega S} ΩS;
- 动作的集合 Ω A {\displaystyle \Omega A} ΩA;
- 智能体针对环境状态的策略 π {\displaystyle \pi} π;
- 规定转换后“即时奖励”的规则(奖励函数) Ω R {\displaystyle \Omega R} ΩR;
- 当前观察状态 S {\displaystyle S} S;
- 下一步状态 S ′ {\displaystyle S'} S′;
- 当前状态下的动作 A {\displaystyle A} A;
- 当前状态下获得的奖励 R {\displaystyle R} R;
如上图所示智能体获取当前状态 S {\displaystyle S} S根据智能体的策略 π {\displaystyle \pi} π得到当前最大利益的动作 A {\displaystyle A} A,并观察下一步获得的奖励 R {\displaystyle R} R与动作 S ′ {\displaystyle S'} S′更新策略 π {\displaystyle \pi} π;
对于环境,一般考虑是满足一阶马尔可夫过程的,即下一步状态只与当前状态与当前状态下的动作有关。
我们要考虑的强化学习算法:
- 智能体能够在有限步之内收敛
- 收敛到全局最优
- 算法的执行速度要快,不然与环境交互的过程异常难堪
- 算法用到的内存是有限的
- 收敛速度要尽量快
强化学习的一般算法有:
- 蒙特卡洛学习 Monte-Carlo Learning
- Q-Learning
(2)Flappy bird game
这是一个经典的游戏,本文的算法描述也会基于此游戏。
(3)Q_learning
方法用同样状态 S S S下不同动作 A A A的Q值的方式进行决策。
算法描述:
中文描述:
- 对于所有的状态 S S S与动作 A A A初始化一张Q值表,并且所有的Q值为0
- 然后就开始训练了
- 训练的每一步,我们会得到这一刻游戏的状态 S S S
- 选取动作:查看 S S S状态下哪个行为的Q值最大,选择这个状态
- 执行动作
- 获取奖励 R R R,与下一步状态 S ′ S' S′
- 更新上一步状态 Q ( S , A ) = Q ( S , A ) + α ( R + γ M a x Q ( S ′ , a ) − Q ( S , A ) ) Q(S,A)=Q(S,A)+\alpha(R+\gamma MaxQ(S',a)-Q(S,A)) Q(S,A)=Q(S,A)+α(R+γMaxQ(S′,a)−Q(S,A))
- M a x Q ( S ′ , a ) MaxQ(S',a) MaxQ(S′,a)指的下一步状态 S ′ S' S′下最大Q值的动作 A A A的Q值
- 继续迭代
说明:
- α \alpha α指的学习率
- γ \gamma γ指的折扣因子
思考:
- γ M a x Q ( S ′ , a ) \gamma MaxQ(S',a) γMaxQ(S′,a)用于反馈动作执行后的奖励影响
(4)Q_learning 初步实现
一般考虑如上图的小鸟到下一个管道的
d
x
,
d
y
dx,dy
dx,dy作为状态。
class player1():
def __init__(self):
self.LR = 0.7
self.factor = 0.95
self.score_max = 0
self.game_count = 0
self.load_qvalues()
self.history = []
self.last_statu = None
self.last_step = None
self.score = 0
def load_qvalues(self):
"""
Load q values from a JSON file
"""
self.qvalues = {}
# try open file
try:
fil = open("qvalues1.json", "r")
self.qvalues = json.load(fil)
fil.close()
except:
# X -> [0,1...28,29]
# Y -> [-52,-51...51,52]
for x in range(-1, 290):
for y in range(-530, 530):
self.qvalues[str(x) + "_" + str(y)] = [0, 0]
self.dump_qvalues()
def act(self, dx, dy, vY, over):
dx = int(dx % 290)
dy = int(dy)
now_statu = str(dx)+"_"+str(dy)
reward = 3
if over == 1:
reward = -1000
if self.last_step is not None:
self.update_qvalue(
self.last_statu, self.last_step, now_statu, reward)
if over == 1:
if self.score > self.score_max:
self.score_max = self.score
print("game-", self.game_count, "-score-",
self.score, "-max score-", self.score_max)
self.score = 0
self.last_step = None
self.game_count += 1
if self.game_count % 40 == 0:
self.dump_qvalues()
return None
self.score += 1
self.last_statu = now_statu
if(self.qvalues[now_statu][0] >= self.qvalues[now_statu][1]):
self.last_step = 0
return 0
else:
self.last_step = 1
return 1
def dump_qvalues(self):
"""
Dump the qvalues to the JSON file
"""
fil = open("qvalues1.json", "w")
json.dump(self.qvalues, fil)
fil.close()
print("Q-values updated on local file of train loops",
int(self.game_count/40))
def update_qvalue(self, now_statu, now_step, next_statu, reward):
self.qvalues[now_statu][now_step] = self.qvalues[now_statu][now_step] * \
(1-self.LR)+self.LR * \
(reward+self.factor*max(self.qvalues[next_statu]))
def dump_qvalues(self):
"""
Dump the qvalues to the JSON file
"""
fil = open("qvalues1.json", "w")
json.dump(self.qvalues, fil)
fil.close()
print("Q-values updated on local file of train loops",
int(self.game_count/40))
缺点:
- 状态集合太大
- 由于状态集合太大,每个状态被更新的次数减少,收敛慢
- 没有对小鸟撞上侧进行处理,收敛慢
- 每次的负值的奖励更新慢,收敛慢
(4)Q_learning 优化实现
class player2():
def __init__(self):
self.LR = 0.7
self.factor = 0.95
self.score_max = 0
self.game_count = 0
self.load_qvalues()
self.history = []
self.last_statu = None
self.last_step = None
self.score = 0
def load_qvalues(self):
"""
Load q values from a JSON file
"""
self.qvalues = {}
# try open file
try:
fil = open("qvalues.json", "r")
self.qvalues = json.load(fil)
fil.close()
except:
#优化一:状态空间缩小,加入竖直方向速度
# X -> [0,1...28,29]
# Y -> [-52,-51...51,52]
for x in range(-1, 30):
for y in range(-53, 53):
for v in range(-11, 11):
self.qvalues[str(x) + "_" + str(y) +
"_" + str(v)] = [0, 0]
self.dump_qvalues()
def act(self, dx, dy, vY, over):
if over == 1:
if self.score > self.score_max:
self.score_max = self.score
print("game-", self.game_count, "-score-",
self.score, "-max score-", self.score_max)
self.score = 0
self.last_step = None
self.game_count += 1
self.update_qvalue()
return None
dx = int((dx % 290)/10)
dy = int(dy/10)
now_statu = str(dx)+"_"+str(dy)+"_"+str(vY)
if self.last_step is not None:
self.history.append((self.last_statu, now_statu, self.last_step))
self.score += 1
self.last_statu = now_statu
if(self.qvalues[now_statu][0] >= self.qvalues[now_statu][1]):
self.last_step = 0
return 0
else:
self.last_step = 1
return 1
def dump_qvalues(self):
"""
Dump the qvalues to the JSON file
"""
fil = open("qvalues.json", "w")
json.dump(self.qvalues, fil)
fil.close()
print("Q-values updated on local file of train loops",
int(self.game_count/40))
def update_qvalue(self):
#优化二:每一局结束后进行更新,并且从后往前更新,保证负奖励的传递
self.history = list(reversed(self.history))
#优化三:判断是否碰上壁
high_death_flag = True if int(
self.history[0][1].split("_")[1]) > 5 else False
if high_death_flag:
print("high death")
count = 0
for i in self.history:
now_statu = i[0]
next_statu = i[1]
now_step = i[2]
reward = 1.0
if count < 2:
reward = -1000
#优化三:碰上壁后,对最近的向上的动作赋予负奖励
elif high_death_flag and act:
reward = -1000
high_death_flag = False
count += 1
self.qvalues[now_statu][now_step] = self.qvalues[now_statu][now_step] * \
(1-self.LR)+self.LR * \
(reward+self.factor*max(self.qvalues[next_statu]))
if self.game_count % 40 == 0:
self.dump_qvalues()
self.history = []
pass
def dump_qvalues(self):
"""
Dump the qvalues to the JSON file
"""
fil = open("qvalues.json", "w")
json.dump(self.qvalues, fil)
fil.close()
print("Q-values updated on local file of train loops",
int(self.game_count/40))
优化:
- 优化一:状态空间缩小,加入竖直方向速度
- 优化二:每一局结束后进行更新,并且从后往前更新,保证负奖励的传递
- 优化三:判断是否碰上壁,碰上壁后,对最近的向上的动作赋予负奖励
(5)Deep Q_learning
DQN属于DRL(深度强化学习)的一种,它是深度学习与Q学习的结合体。前面讲了采用S-A表格的局限性,当状态和行为的组合不可穷尽时,就无法通过查表的方式选取最优的Action了。
损失函数:
强化学习流程:
中文描述:
- 初始化一个replay经验池 D D D用来储存 N N N
- 初始化状态-价值网络 Q Q Q并赋予随机权值 θ \theta θ
- 初始化目标状态-价值网络 Q ^ \hat{Q} Q^,并且设置 Q ^ \hat{Q} Q^的权值 θ − \theta^- θ−等于 Q {Q} Q权值 θ \theta θ
- 设置一个概率 ϵ \epsilon ϵ
- 开始训练循环
- 得到初始状态 s 1 = x 1 s_1={x_1} s1=x1,预处理 ϕ 1 = ϕ ( s 1 ) \phi_1=\phi(s_1) ϕ1=ϕ(s1)
- 开始游戏循环
- 以 ϵ \epsilon ϵ随机选择一个动作 a t a_t at
- 或者选择动作 a t = m a x a Q ∗ ( ϕ ( s t ) , a ; θ ) a_t=max_aQ^*(\phi(s_t),a;\theta) at=maxaQ∗(ϕ(st),a;θ)
- 执行动作 a t a_t at得到奖励 r t r_t rt与相关信息 x t + 1 x_{t+1} xt+1
- 让 s t + 1 = s t , a t , x t + 1 s_{t+1}=s_t,a_t,x_{t+1} st+1=st,at,xt+1并且预处理 ϕ t + 1 = ϕ ( s t + 1 ) \phi_{t+1}=\phi(s_{t+1}) ϕt+1=ϕ(st+1)
- 保存一次数据 ( ϕ t , α t , r t , ϕ t + 1 ) (\phi_t,\alpha_t,r_t,\phi_{t+1}) (ϕt,αt,rt,ϕt+1) 于经验池
- 设置经验对应训练标签为: y j = r j y_j=r_j yj=rj 如果 ϕ j + 1 \phi_{j+1} ϕj+1游戏终止了
- 或者训练标签为: y j = r j + γ m a x a ′ Q ^ ( ϕ j + 1 , a ′ ; θ − ) y_j=r_j+\gamma max_{a'}\hat{Q}(\phi_{j+1},a';\theta^-) yj=rj+γmaxa′Q^(ϕj+1,a′;θ−) 如果 ϕ j + 1 \phi_{j+1} ϕj+1游戏没终止
- 损失函数为 ( y j − Q ( ϕ j , a j ; θ ) ) 2 (y_j-Q(\phi_j,a_j;\theta))^2 (yj−Q(ϕj,aj;θ))2,从replay经验池中随机选择一部分数据,训练状态-价值网络 Q Q Q,并跟新权重 θ \theta θ
- 每隔 C C C步让 Q ^ = Q \hat{Q}=Q Q^=Q
说明:
- α \alpha α指的学习率
- γ \gamma γ指的折扣因子
- 概率 ϵ \epsilon ϵ最初设置很大,之后一点点变小,设置一个最小阈值
思考:
- 概率 ϵ \epsilon ϵ体现的是随机探索过程,智能体刚开始的时候,需要大量随机探索,随着智能体的成熟,探索减小
- replay经验池 D D D中随机选择数据训练的原因是降低数据的时间相关性,让网络不会过拟合;
- 设置两个网络 Q Q Q与 Q ^ \hat{Q} Q^为防止过拟合。试想如果只有一个神经网络,那么它就在会不停地更新,那么它所追求的目标是在一直改变的,即在 θ θ θ改变的时候,不止 Q ( s , a ) Q(s, a) Q(s,a)变了, m a x Q ( s ’ , a ’ ) maxQ(s’, a’) maxQ(s’,a’)也变了。这样的好处是使得上面公式中target所标注的部分是暂时固定的,我们不断更新 θ θ θ追逐的是一个固定的目标,而不是一直改变的目标。
(6)Deep Q_learning 实现
from numpy.lib.function_base import percentile
import io_game
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
print("import ok ...")
# global parameters
global ACTIONS, INPUTS, STATE_SHAPE
ACTIONS = 2
INPUTS = 6
STATE_SHAPE = 6
# init thread and import game io
threadsPool = []
global game_t
game_t = io_game.bird_game(1)
threadsPool.append(game_t)
# class of Neural network
class DDDQN(tf.keras.Model):
def __init__(self):
super(DDDQN, self).__init__()
self.d1 = tf.keras.layers.Dense(INPUTS, activation='relu')
self.d2 = tf.keras.layers.Dense(128, activation='relu')
self.v = tf.keras.layers.Dense(1, activation=None)
self.a = tf.keras.layers.Dense(ACTIONS, activation=None)
def call(self, input_data):
x = self.d1(input_data)
x = self.d2(x)
v = self.v(x)
a = self.a(x)
# error function
Q = v + (a - tf.math.reduce_mean(a, axis=1, keepdims=True))
return Q
def advantage(self, state):
x = self.d1(state)
x = self.d2(x)
a = self.a(x)
return a
class exp_replay():
def __init__(self, buffer_size=100000):
self.buffer_size = buffer_size
self.state_mem = np.zeros(
(self.buffer_size, STATE_SHAPE), dtype=np.float32)
self.action_mem = np.zeros((self.buffer_size), dtype=np.int32)
self.reward_mem = np.zeros((self.buffer_size), dtype=np.float32)
self.next_state_mem = np.zeros(
(self.buffer_size, STATE_SHAPE), dtype=np.float32)
self.done_mem = np.zeros((self.buffer_size), dtype=np.bool)
self.pointer = 0
def add_exp(self, state, action, reward, next_state, done):
idx = self.pointer % self.buffer_size
self.state_mem[idx] = state
self.action_mem[idx] = action
self.reward_mem[idx] = reward
self.next_state_mem[idx] = next_state
self.done_mem[idx] = 1 - int(done)
self.pointer += 1
def sample_exp(self, batch_size=64):
max_mem = min(self.pointer, self.buffer_size)
batch = np.random.choice(max_mem, batch_size, replace=False)
states = self.state_mem[batch]
actions = self.action_mem[batch]
rewards = self.reward_mem[batch]
next_states = self.next_state_mem[batch]
dones = self.done_mem[batch]
return states, actions, rewards, next_states, dones
class agent():
def __init__(self, gamma=0.99, replace=100, lr=0.001):
self.gamma = gamma
self.auto_save = 40
self.train_count = 0
self.epsilon = 1.0
self.min_epsilon = 0.001
self.epsilon_decay = 0.1*3
self.replace = replace
self.trainstep = 0
self.memory = exp_replay()
self.batch_size = 32
self.q_net = DDDQN()
self.target_net = DDDQN()
try:
self.q_net.load_weights("game_weights")
self.target_net.load_weights("game_weights")
print("weight loaded ...")
except:
pass
opt = tf.keras.optimizers.Adam(learning_rate=lr)
self.q_net.compile(loss='mse', optimizer=opt)
self.target_net.compile(loss='mse', optimizer=opt)
def act(self, state):
if np.random.rand() <= self.epsilon:
return np.random.choice([int(i/20) for i in range(ACTIONS+20)])
else:
actions = self.q_net.advantage(np.array([state]))
action = np.argmax(actions)
return action
def update_mem(self, state, action, reward, next_state, done):
self.memory.add_exp(state, action, reward, next_state, done)
def update_target(self):
self.target_net.set_weights(self.q_net.get_weights())
def update_epsilon(self):
self.epsilon = self.epsilon - \
self.epsilon_decay if self.epsilon > self.min_epsilon else self.min_epsilon
return self.epsilon
def train(self):
if self.memory.pointer < self.batch_size:
return
self.train_count += 1
# Q^=Q
if self.trainstep % self.replace == 0:
self.update_target()
states, actions, rewards, next_states, dones = self.memory.sample_exp(
self.batch_size)
target = self.q_net.predict(states) # [q_a,q_b]
q_target = np.copy(target)
next_state_val = self.target_net.predict(next_states) # [q^_a,q^_b]
max_action = np.argmax(self.q_net.predict(
next_states), axis=1) # next max action
batch_index = np.arange(self.batch_size, dtype=np.int32)
q_target[batch_index, actions] = rewards + self.gamma * \
next_state_val[batch_index, max_action]*dones # expect q value
self.q_net.train_on_batch(states, q_target)
self.update_epsilon()
self.trainstep += 1
if self.train_count % self.auto_save == 0:
self.dump_model()
print("model saved...")
def dump_model(self):
self.q_net.save_weights("game_weights")
self.target_net.save_weights("game_weights")
if __name__ == "__main__":
for thread in threadsPool:
thread.start()
steps = 400000
myplayer = agent()
game_t.startGame()
for s in range(steps):
done = False
# get init state
dx, dy, vY, game_over, dx2, dy2, _, y = game_t.get_statu()
state = [dx, dy, vY, dx2, dy2, y]
total_reward = 0
while not done:
update = game_t.data_update()
if update:
action = myplayer.act(state)
game_t.give_control(action)
dx, dy, vY, game_over, dx2, dy2, _, y = game_t.get_statu()
next_state = [dx, dy, vY, dx2, dy2, y]
done = game_over
reward = 1
if done == 1:
reward = -1000
myplayer.update_mem(state, action, reward, next_state, done)
state = next_state
total_reward += 1
if done:
print("total reward after {} episode is {} and epsilon is {}".format(
s, total_reward, myplayer.epsilon))
myplayer.train()
game_t.startGame()
break
(7)参考
[1]yinyoupoet.深度强化学习之深度Q网络DQN详解[EB/OL].yinyoupoet.github.io/2020/02/18/深度强化学习之深度Q网络DQN详解/#强化学习,2020-02-18.
[2]Abhishek Suran.Dueling Double Deep Q Learning using Tensorflow
2.x[EB/OL].https://towardsdatascience.com/dueling-double-deep-q-learning-using-tensorflow-2-x-7bbbcec06a2a,2020-7-10.
[3]wiki.强化学习[EB/OL].zh.wikipedia.org/wiki/强化学习,2021-3-23.
[4]野风.强化学习——从Q-Learning到DQN到底发生了什么?
[EB/OL].https://zhuanlan.zhihu.com/p/35882937,2018-04-19.
[5]朱松纯.浅谈人工智能:现状、任务、构架与统一 | 正本清源[EB/OL].https://mp.weixin.qq.com/s/3sKfJnPayDCCosKVP3Jz8Q?,2018-07-27.