从Q-Learning到Deep Q-Learning(DQN)

(0)前言

本文主要介绍了强化学习的Q-Learning到Deep Q-Learning,并将代码分别应用到了FlappyBird小游戏中,文末给出了所有参考资料。

(1)强化学习

P1

有机体在与环境不断交互的过程中,在与环境给予的奖励(reward)或惩罚的刺激下,逐步形成对刺激的预期,产生能获得最大利益的习惯性行为。

一般而言强化学习要考虑:

  1. 环境状态的集合 Ω S {\displaystyle \Omega S} ΩS;
  2. 动作的集合 Ω A {\displaystyle \Omega A} ΩA;
  3. 智能体针对环境状态的策略 π {\displaystyle \pi} π
  4. 规定转换后“即时奖励”的规则(奖励函数) Ω R {\displaystyle \Omega R} ΩR
  5. 当前观察状态 S {\displaystyle S} S;
  6. 下一步状态 S ′ {\displaystyle S'} S;
  7. 当前状态下的动作 A {\displaystyle A} A;
  8. 当前状态下获得的奖励 R {\displaystyle R} R;

如上图所示智能体获取当前状态 S {\displaystyle S} S根据智能体的策略 π {\displaystyle \pi} π得到当前最大利益的动作 A {\displaystyle A} A,并观察下一步获得的奖励 R {\displaystyle R} R与动作 S ′ {\displaystyle S'} S更新策略 π {\displaystyle \pi} π

对于环境,一般考虑是满足一阶马尔可夫过程的,即下一步状态只与当前状态与当前状态下的动作有关。

我们要考虑的强化学习算法:

  1. 智能体能够在有限步之内收敛
  2. 收敛到全局最优
  3. 算法的执行速度要快,不然与环境交互的过程异常难堪
  4. 算法用到的内存是有限的
  5. 收敛速度要尽量快

强化学习的一般算法有:

  • 蒙特卡洛学习 Monte-Carlo Learning
  • Q-Learning

(2)Flappy bird game

在这里插入图片描述
这是一个经典的游戏,本文的算法描述也会基于此游戏。

(3)Q_learning

方法用同样状态 S S S下不同动作 A A A的Q值的方式进行决策。

算法描述:
在这里插入图片描述
中文描述:

  • 对于所有的状态 S S S与动作 A A A初始化一张Q值表,并且所有的Q值为0
  • 然后就开始训练了
    • 训练的每一步,我们会得到这一刻游戏的状态 S S S
    • 选取动作:查看 S S S状态下哪个行为的Q值最大,选择这个状态
    • 执行动作
    • 获取奖励 R R R,与下一步状态 S ′ S' S
    • 更新上一步状态 Q ( S , A ) = Q ( S , A ) + α ( R + γ M a x Q ( S ′ , a ) − Q ( S , A ) ) Q(S,A)=Q(S,A)+\alpha(R+\gamma MaxQ(S',a)-Q(S,A)) Q(S,A)=Q(S,A)+α(R+γMaxQ(S,a)Q(S,A))
    • M a x Q ( S ′ , a ) MaxQ(S',a) MaxQ(S,a)指的下一步状态 S ′ S' S下最大Q值的动作 A A A的Q值
    • 继续迭代

说明:

  • α \alpha α指的学习率
  • γ \gamma γ指的折扣因子

思考:

  • γ M a x Q ( S ′ , a ) \gamma MaxQ(S',a) γMaxQ(S,a)用于反馈动作执行后的奖励影响

(4)Q_learning 初步实现

在这里插入图片描述
一般考虑如上图的小鸟到下一个管道的 d x , d y dx,dy dx,dy作为状态。

class player1():
    def __init__(self):
        self.LR = 0.7
        self.factor = 0.95
        self.score_max = 0
        self.game_count = 0
        self.load_qvalues()
        self.history = []
        self.last_statu = None
        self.last_step = None
        self.score = 0

    def load_qvalues(self):
        """
        Load q values from a JSON file
        """
        self.qvalues = {}
        # try open file
        try:
            fil = open("qvalues1.json", "r")
            self.qvalues = json.load(fil)
            fil.close()
        except:
            # X -> [0,1...28,29]
            # Y -> [-52,-51...51,52]
            for x in range(-1, 290):
                for y in range(-530, 530):
                    self.qvalues[str(x) + "_" + str(y)] = [0, 0]
            self.dump_qvalues()

    def act(self, dx, dy, vY, over):
        dx = int(dx % 290)
        dy = int(dy)
        now_statu = str(dx)+"_"+str(dy)
        reward = 3
        if over == 1:
            reward = -1000
        if self.last_step is not None:
            self.update_qvalue(
                self.last_statu, self.last_step, now_statu, reward)
        if over == 1:
            if self.score > self.score_max:
                self.score_max = self.score
            print("game-", self.game_count, "-score-",
                  self.score, "-max score-", self.score_max)
            self.score = 0
            self.last_step = None
            self.game_count += 1
            if self.game_count % 40 == 0:
                self.dump_qvalues()
            return None
        self.score += 1
        self.last_statu = now_statu
        if(self.qvalues[now_statu][0] >= self.qvalues[now_statu][1]):
            self.last_step = 0
            return 0
        else:
            self.last_step = 1
            return 1

    def dump_qvalues(self):
        """
        Dump the qvalues to the JSON file
        """
        fil = open("qvalues1.json", "w")
        json.dump(self.qvalues, fil)
        fil.close()
        print("Q-values updated on local file of train loops",
              int(self.game_count/40))

    def update_qvalue(self, now_statu, now_step, next_statu, reward):
        self.qvalues[now_statu][now_step] = self.qvalues[now_statu][now_step] * \
            (1-self.LR)+self.LR * \
            (reward+self.factor*max(self.qvalues[next_statu]))

    def dump_qvalues(self):
        """
        Dump the qvalues to the JSON file
        """
        fil = open("qvalues1.json", "w")
        json.dump(self.qvalues, fil)
        fil.close()
        print("Q-values updated on local file of train loops",
              int(self.game_count/40))

缺点:

  • 状态集合太大
  • 由于状态集合太大,每个状态被更新的次数减少,收敛慢
  • 没有对小鸟撞上侧进行处理,收敛慢
  • 每次的负值的奖励更新慢,收敛慢

(4)Q_learning 优化实现


class player2():
    def __init__(self):
        self.LR = 0.7
        self.factor = 0.95
        self.score_max = 0
        self.game_count = 0
        self.load_qvalues()
        self.history = []
        self.last_statu = None
        self.last_step = None
        self.score = 0

    def load_qvalues(self):
        """
        Load q values from a JSON file
        """
        self.qvalues = {}
        # try open file
        try:
            fil = open("qvalues.json", "r")
            self.qvalues = json.load(fil)
            fil.close()
        except:
        	#优化一:状态空间缩小,加入竖直方向速度
            # X -> [0,1...28,29]
            # Y -> [-52,-51...51,52]
            for x in range(-1, 30):
                for y in range(-53, 53):
                    for v in range(-11, 11):
                        self.qvalues[str(x) + "_" + str(y) +
                                     "_" + str(v)] = [0, 0]
            self.dump_qvalues()

    def act(self, dx, dy, vY, over):
        if over == 1:
            if self.score > self.score_max:
                self.score_max = self.score
            print("game-", self.game_count, "-score-",
                  self.score, "-max score-", self.score_max)
            self.score = 0
            self.last_step = None
            self.game_count += 1
            self.update_qvalue()
            return None
        dx = int((dx % 290)/10)
        dy = int(dy/10)
        now_statu = str(dx)+"_"+str(dy)+"_"+str(vY)
        if self.last_step is not None:
            self.history.append((self.last_statu, now_statu, self.last_step))
        self.score += 1
        self.last_statu = now_statu
        if(self.qvalues[now_statu][0] >= self.qvalues[now_statu][1]):
            self.last_step = 0
            return 0
        else:
            self.last_step = 1
            return 1

    def dump_qvalues(self):
        """
        Dump the qvalues to the JSON file
        """
        fil = open("qvalues.json", "w")
        json.dump(self.qvalues, fil)
        fil.close()
        print("Q-values updated on local file of train loops",
              int(self.game_count/40))

    def update_qvalue(self):
    	#优化二:每一局结束后进行更新,并且从后往前更新,保证负奖励的传递
        self.history = list(reversed(self.history))
        #优化三:判断是否碰上壁
        high_death_flag = True if int(
            self.history[0][1].split("_")[1]) > 5 else False
        if high_death_flag:
            print("high death")
        count = 0
        for i in self.history:
            now_statu = i[0]
            next_statu = i[1]
            now_step = i[2]
            reward = 1.0
            if count < 2:
                reward = -1000
            #优化三:碰上壁后,对最近的向上的动作赋予负奖励
            elif high_death_flag and act:
                reward = -1000
                high_death_flag = False
            count += 1
            self.qvalues[now_statu][now_step] = self.qvalues[now_statu][now_step] * \
                (1-self.LR)+self.LR * \
                (reward+self.factor*max(self.qvalues[next_statu]))
        if self.game_count % 40 == 0:
            self.dump_qvalues()
        self.history = []
        pass

    def dump_qvalues(self):
        """
        Dump the qvalues to the JSON file
        """
        fil = open("qvalues.json", "w")
        json.dump(self.qvalues, fil)
        fil.close()
        print("Q-values updated on local file of train loops",
              int(self.game_count/40))

优化:

  • 优化一:状态空间缩小,加入竖直方向速度
  • 优化二:每一局结束后进行更新,并且从后往前更新,保证负奖励的传递
  • 优化三:判断是否碰上壁,碰上壁后,对最近的向上的动作赋予负奖励

(5)Deep Q_learning

DQN属于DRL(深度强化学习)的一种,它是深度学习与Q学习的结合体。前面讲了采用S-A表格的局限性,当状态和行为的组合不可穷尽时,就无法通过查表的方式选取最优的Action了。

在这里插入图片描述
损失函数:
在这里插入图片描述

强化学习流程:
在这里插入图片描述

中文描述:

  • 初始化一个replay经验池 D D D用来储存 N N N
  • 初始化状态-价值网络 Q Q Q并赋予随机权值 θ \theta θ
  • 初始化目标状态-价值网络 Q ^ \hat{Q} Q^,并且设置 Q ^ \hat{Q} Q^的权值 θ − \theta^- θ等于 Q {Q} Q权值 θ \theta θ
  • 设置一个概率 ϵ \epsilon ϵ
  • 开始训练循环
    • 得到初始状态 s 1 = x 1 s_1={x_1} s1=x1,预处理 ϕ 1 = ϕ ( s 1 ) \phi_1=\phi(s_1) ϕ1=ϕ(s1)
    • 开始游戏循环
      • ϵ \epsilon ϵ随机选择一个动作 a t a_t at
      • 或者选择动作 a t = m a x a Q ∗ ( ϕ ( s t ) , a ; θ ) a_t=max_aQ^*(\phi(s_t),a;\theta) at=maxaQ(ϕ(st),a;θ)
      • 执行动作 a t a_t at得到奖励 r t r_t rt与相关信息 x t + 1 x_{t+1} xt+1
      • s t + 1 = s t , a t , x t + 1 s_{t+1}=s_t,a_t,x_{t+1} st+1=st,at,xt+1并且预处理 ϕ t + 1 = ϕ ( s t + 1 ) \phi_{t+1}=\phi(s_{t+1}) ϕt+1=ϕ(st+1)
      • 保存一次数据 ( ϕ t , α t , r t , ϕ t + 1 ) (\phi_t,\alpha_t,r_t,\phi_{t+1}) (ϕt,αt,rt,ϕt+1) 于经验池
      • 设置经验对应训练标签为: y j = r j y_j=r_j yj=rj 如果 ϕ j + 1 \phi_{j+1} ϕj+1游戏终止了
      • 或者训练标签为: y j = r j + γ m a x a ′ Q ^ ( ϕ j + 1 , a ′ ; θ − ) y_j=r_j+\gamma max_{a'}\hat{Q}(\phi_{j+1},a';\theta^-) yj=rj+γmaxaQ^(ϕj+1,a;θ) 如果 ϕ j + 1 \phi_{j+1} ϕj+1游戏没终止
      • 损失函数为 ( y j − Q ( ϕ j , a j ; θ ) ) 2 (y_j-Q(\phi_j,a_j;\theta))^2 (yjQ(ϕj,aj;θ))2,从replay经验池中随机选择一部分数据,训练状态-价值网络 Q Q Q,并跟新权重 θ \theta θ
      • 每隔 C C C步让 Q ^ = Q \hat{Q}=Q Q^=Q

说明:

  • α \alpha α指的学习率
  • γ \gamma γ指的折扣因子
  • 概率 ϵ \epsilon ϵ最初设置很大,之后一点点变小,设置一个最小阈值

思考:

  1. 概率 ϵ \epsilon ϵ体现的是随机探索过程,智能体刚开始的时候,需要大量随机探索,随着智能体的成熟,探索减小
  2. replay经验池 D D D中随机选择数据训练的原因是降低数据的时间相关性,让网络不会过拟合;
  3. 设置两个网络 Q Q Q Q ^ \hat{Q} Q^为防止过拟合。试想如果只有一个神经网络,那么它就在会不停地更新,那么它所追求的目标是在一直改变的,即在 θ θ θ改变的时候,不止 Q ( s , a ) Q(s, a) Q(s,a)变了, m a x Q ( s ’ , a ’ ) maxQ(s’, a’) maxQ(s,a)也变了。这样的好处是使得上面公式中target所标注的部分是暂时固定的,我们不断更新 θ θ θ追逐的是一个固定的目标,而不是一直改变的目标。

(6)Deep Q_learning 实现

from numpy.lib.function_base import percentile
import io_game
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
print("import ok ...")
# global parameters
global ACTIONS, INPUTS, STATE_SHAPE
ACTIONS = 2
INPUTS = 6
STATE_SHAPE = 6
# init thread and import game io
threadsPool = []
global game_t
game_t = io_game.bird_game(1)
threadsPool.append(game_t)
# class of Neural network


class DDDQN(tf.keras.Model):
    def __init__(self):
        super(DDDQN, self).__init__()
        self.d1 = tf.keras.layers.Dense(INPUTS, activation='relu')
        self.d2 = tf.keras.layers.Dense(128, activation='relu')
        self.v = tf.keras.layers.Dense(1, activation=None)
        self.a = tf.keras.layers.Dense(ACTIONS, activation=None)

    def call(self, input_data):
        x = self.d1(input_data)
        x = self.d2(x)
        v = self.v(x)
        a = self.a(x)
        # error function
        Q = v + (a - tf.math.reduce_mean(a, axis=1, keepdims=True))
        return Q

    def advantage(self, state):
        x = self.d1(state)
        x = self.d2(x)
        a = self.a(x)
        return a


class exp_replay():
    def __init__(self, buffer_size=100000):
        self.buffer_size = buffer_size
        self.state_mem = np.zeros(
            (self.buffer_size, STATE_SHAPE), dtype=np.float32)
        self.action_mem = np.zeros((self.buffer_size), dtype=np.int32)
        self.reward_mem = np.zeros((self.buffer_size), dtype=np.float32)
        self.next_state_mem = np.zeros(
            (self.buffer_size, STATE_SHAPE), dtype=np.float32)
        self.done_mem = np.zeros((self.buffer_size), dtype=np.bool)
        self.pointer = 0

    def add_exp(self, state, action, reward, next_state, done):
        idx = self.pointer % self.buffer_size
        self.state_mem[idx] = state
        self.action_mem[idx] = action
        self.reward_mem[idx] = reward
        self.next_state_mem[idx] = next_state
        self.done_mem[idx] = 1 - int(done)
        self.pointer += 1

    def sample_exp(self, batch_size=64):
        max_mem = min(self.pointer, self.buffer_size)
        batch = np.random.choice(max_mem, batch_size, replace=False)
        states = self.state_mem[batch]
        actions = self.action_mem[batch]
        rewards = self.reward_mem[batch]
        next_states = self.next_state_mem[batch]
        dones = self.done_mem[batch]
        return states, actions, rewards, next_states, dones


class agent():
    def __init__(self, gamma=0.99, replace=100, lr=0.001):
        self.gamma = gamma
        self.auto_save = 40
        self.train_count = 0
        self.epsilon = 1.0
        self.min_epsilon = 0.001
        self.epsilon_decay = 0.1*3
        self.replace = replace
        self.trainstep = 0
        self.memory = exp_replay()
        self.batch_size = 32
        self.q_net = DDDQN()
        self.target_net = DDDQN()
        try:
            self.q_net.load_weights("game_weights")
            self.target_net.load_weights("game_weights")
            print("weight loaded ...")
        except:
            pass
        opt = tf.keras.optimizers.Adam(learning_rate=lr)
        self.q_net.compile(loss='mse', optimizer=opt)
        self.target_net.compile(loss='mse', optimizer=opt)

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return np.random.choice([int(i/20) for i in range(ACTIONS+20)])
        else:
            actions = self.q_net.advantage(np.array([state]))
            action = np.argmax(actions)
            return action

    def update_mem(self, state, action, reward, next_state, done):
        self.memory.add_exp(state, action, reward, next_state, done)

    def update_target(self):
        self.target_net.set_weights(self.q_net.get_weights())

    def update_epsilon(self):
        self.epsilon = self.epsilon - \
            self.epsilon_decay if self.epsilon > self.min_epsilon else self.min_epsilon
        return self.epsilon

    def train(self):
        if self.memory.pointer < self.batch_size:
            return
        self.train_count += 1
        # Q^=Q
        if self.trainstep % self.replace == 0:
            self.update_target()
        states, actions, rewards, next_states, dones = self.memory.sample_exp(
            self.batch_size)

        target = self.q_net.predict(states)  # [q_a,q_b]
        q_target = np.copy(target) 
        next_state_val = self.target_net.predict(next_states)  # [q^_a,q^_b]
        max_action = np.argmax(self.q_net.predict(
            next_states), axis=1)  # next max action
        batch_index = np.arange(self.batch_size, dtype=np.int32)
        q_target[batch_index, actions] = rewards + self.gamma * \
            next_state_val[batch_index, max_action]*dones  # expect q value
        self.q_net.train_on_batch(states, q_target)

        self.update_epsilon()
        self.trainstep += 1
        if self.train_count % self.auto_save == 0:
            self.dump_model()
            print("model saved...")

    def dump_model(self):
        self.q_net.save_weights("game_weights")
        self.target_net.save_weights("game_weights")


if __name__ == "__main__":
    for thread in threadsPool:
        thread.start()
    steps = 400000
    myplayer = agent()
    game_t.startGame()
    for s in range(steps):
        done = False
        # get init state
        dx, dy, vY, game_over, dx2, dy2, _, y = game_t.get_statu()
        state = [dx, dy, vY, dx2, dy2, y]
        total_reward = 0
        while not done:
            update = game_t.data_update()
            if update:
                action = myplayer.act(state)
                game_t.give_control(action)
                dx, dy, vY, game_over, dx2, dy2, _, y = game_t.get_statu()
                next_state = [dx, dy, vY, dx2, dy2, y]
                done = game_over
                reward = 1
                if done == 1:
                    reward = -1000
                myplayer.update_mem(state, action, reward, next_state, done)
                state = next_state
                total_reward += 1
                if done:
                    print("total reward after {} episode is {} and epsilon is {}".format(
                        s, total_reward, myplayer.epsilon))
                    myplayer.train()
                    game_t.startGame()
                    break

(7)参考

[1]yinyoupoet.深度强化学习之深度Q网络DQN详解[EB/OL].yinyoupoet.github.io/2020/02/18/深度强化学习之深度Q网络DQN详解/#强化学习,2020-02-18.

[2]Abhishek Suran.Dueling Double Deep Q Learning using Tensorflow
2.x[EB/OL].https://towardsdatascience.com/dueling-double-deep-q-learning-using-tensorflow-2-x-7bbbcec06a2a,2020-7-10.

[3]wiki.强化学习[EB/OL].zh.wikipedia.org/wiki/强化学习,2021-3-23.

[4]野风.强化学习——从Q-Learning到DQN到底发生了什么?
[EB/OL].https://zhuanlan.zhihu.com/p/35882937,2018-04-19.

[5]朱松纯.浅谈人工智能:现状、任务、构架与统一 | 正本清源[EB/OL].https://mp.weixin.qq.com/s/3sKfJnPayDCCosKVP3Jz8Q?,2018-07-27.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值