强化学习系列之Q learning（王者荣耀视角）

本文链接：https://blog.csdn.net/zaowuyingshu/article/details/109999441

本文介绍经典的Q learning强化学习算法，阐述其类似于梯度下降的更新规则，动作函数Q的更新不依赖动作决定方法。以迷宫问题为例，说明如何让智能体自动找到出口，还类比王者荣耀偷家解释强化学习。最后给出了Python实现代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

前言
一、类似于梯度下降的更新规则
二、符号说明
三、基于迷宫问题实现Q-learning
四、python实现

前言

提示：之前已经介绍过在强化学习中最常见的QDN，这次介绍一下最经典的Q learning算法，它是一种价值导向，价值迭代的强化学习算法

一、类似于梯度下降的更新规则

更新逻辑：

Initialize Q（s,a） (英雄复活)
Repeat (for each episode): （直到偷家成果才游戏结束）
     initialize s（英雄复活）
     Repeat (for each step of episode):（不断地死亡不断地学习如何偷家
         choose a from s using policy derived from Q(e.g., greedy)
          take action a ,observe r, s’
           $Q\left( s,a \right) \gets Q\left( s,a \right) +\alpha \left[ \tau +\gamma \,\,\max _{a'}Q\left( s',a' \right) -Q\left( s,a \right) \right]$
          $s\gets s'$
until s is terminal
Q学习使用在状态t+1下动作价值函数值中的最大值来更新，他的动作函数Q的更新不依赖于动作的决定方法。

二、符号说明

符号	说明
$\alpha$	学习率
$a$	行为
$s$	状态
$\gamma$	衰减值
$\tau$	到达新状态时得到的奖励值

三、基于迷宫问题实现Q-learning

本文的迷宫任务如图所以，我们目标是想让agent从start开始自动学会如何找到出口end。整个逻辑思路是每一个格子都是一种状态，在不同的格子中由于环境的不同所以可以采取的action是不同的，比如在start位置我们只能采取向下或者向右的action，我们通过设置每采取一个action他的收益reward都是-1的方式，只有走到end收益才为0并且结束本轮游戏的方式来训练。使得最后智能体可以自己找到一条最快的路线到达end。
王者荣耀题外话：对于这个模型来说其实非常简单，因为我们人类用肉眼就可以看到最短路径就是0-3-4-7-end,但是对于计算机却不是这样的，他完全不认识这个环境，也不知道整个地图是什么样子，他只能通过一步一步的走来找到宝藏。
相信很多人都玩过王者荣耀吧，这就类似于你在玩王者荣耀，对于你没有到达过的地方你是没有视野的（你不能像上帝一样开了全图天眼），Start相当于你的出身地，END是地方水晶，你现在的目标是绕过所有的敌人去偷家（梦泪老师直呼内行！）你需要绕过所有的敌方视野来完成这个任务。我们的智能体所要完成的就是这么一个任务。如果你在看比赛或者OB，那么你可以很轻松的告诉梦泪老师说你走0-3-4-7-end这样的路线就可以偷家。但是当你自己在游戏中时是完全不知道其他环节的，你的视野仅限于你自己，就像计算机中的智能体，所以我们需要不断地尝试来告诉智能体它应该如何走才能偷家翻盘，失败的话就复活重来。大概这就是强化学习的简单解释。
在这里插入图片描述

四、python实现

代码如下（示例）：

import numpy as np
import matplotlib.pyplot as plt

#画图
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()

plt.plot([1, 1], [0, 1], color='blue', linewidth=2)
plt.plot([1, 2], [2, 2], color='b', linewidth=2)
plt.plot([2, 2], [2, 1], color='b', linewidth=2)
plt.plot([2, 3], [1, 1], color='b', linewidth=2)

plt.text(0.5, 2.5, 'Our base', size=14, ha='center')
plt.text(1.5, 2.5, 'S1', size=14, ha='center')
plt.text(2.5, 2.5, 'S2', size=14, ha='center')
plt.text(0.5, 1.5, 'S3', size=14, ha='center')
plt.text(1.5, 1.5, 'S4', size=14, ha='center')
plt.text(2.5, 1.5, 'S5', size=14, ha='center')
plt.text(0.5, 0.5, 'S6', size=14, ha='center')
plt.text(1.5, 0.5, 'S7', size=14, ha='center')
plt.text(2.5, 0.5, 'enemy base', size=14, ha='center')

ax.set_xlim(0, 3)
ax.set_ylim(0, 3)
plt.tick_params(axis='both', which='both', bottom='off', top='off',
                labelbottom='off', right='off', left='off', labelleft='off')

line, = ax.plot([0.5], [2.5], marker="o", color='g', markersize=60)
line, = ax.plot([2.5], [0.5], marker="o", color='red', markersize=60)
plt.show()

# 设定初始的theta_0

# 矩阵的行：状态0～7，也就是每个格子所对应的状态
#矩阵的列：表示上下左右的action，不可移动就是nan
theta_0 = np.array([[np.nan, 1, 1, np.nan],  # s0
                    [np.nan, 1, np.nan, 1],  # s1
                    [np.nan, np.nan, 1, 1],  # s2
                    [1, 1, 1, np.nan],  # s3
                    [np.nan, np.nan, 1, 1],  # s4
                    [1, np.nan, np.nan, np.nan],  # s5
                    [1, np.nan, np.nan, np.nan],  # s6
                    [1, 1, np.nan, np.nan],  # s7、※s8
                    ])

def simple_convert_into_pi_from_theta(theta):
    '''转化为概率'''

    [m, n] = theta.shape
    pi = np.zeros((m, n))
    for i in range(0, m):
        pi[i, :] = theta[i, :] / np.nansum(theta[i, :])  # 计算百分比例

    pi = np.nan_to_num(pi)  # nan=0

    return pi

# 求初始的策略
pi_0 = simple_convert_into_pi_from_theta(theta_0)

# 设定初始Q函数

[a, b] = theta_0.shape
Q = np.random.rand(a, b) * theta_0 * 0.1


def get_action(s, Q, epsilon, pi_0):
    direction = ["up", "right", "down", "left"]

    if np.random.rand() < epsilon:
        next_direction = np.random.choice(direction, p=pi_0[s, :])
    else:
        next_direction = direction[np.nanargmax(Q[s, :])]

    if next_direction == "up":
        action = 0
    elif next_direction == "right":
        action = 1
    elif next_direction == "down":
        action = 2
    elif next_direction == "left":
        action = 3

    return action


def get_s_next(s, a, Q, epsilon, pi_0):
    direction = ["up", "right", "down", "left"]
    next_direction = direction[a]


    if next_direction == "up":
        s_next = s - 3
    elif next_direction == "right":
        s_next = s + 1
    elif next_direction == "down":
        s_next = s + 3
    elif next_direction == "left":
        s_next = s - 1

    return s_next

def Q_learning(s, a, r, s_next, Q, eta, gamma):

    if s_next == 8:  # 到达目的地
        Q[s, a] = Q[s, a] + eta * (r - Q[s, a])

    else:
        Q[s, a] = Q[s, a] + eta * (r + gamma * np.nanmax(Q[s_next,: ]) - Q[s, a])

    return Q

def goal_maze_ret_s_a_Q(Q, epsilon, eta, gamma, pi):
    s = 0
    a = a_next = get_action(s, Q, epsilon, pi)
    s_a_history = [[0, np.nan]]

    while (1):
        a = a_next  # 行動更新

        s_a_history[-1][1] = a

        s_next = get_s_next(s, a, Q, epsilon, pi)

        s_a_history.append([s_next, np.nan])

        if s_next == 8:
            r = 1
            a_next = np.nan
        else:
            r = 0
            a_next = get_action(s_next, Q, epsilon, pi)



        Q = Q_learning(s, a, r, s_next, Q, eta, gamma)


        if s_next == 8:
            break
        else:
            s = s_next

    return [s_a_history, Q]

eta = 0.1  # 学习率
gamma = 0.9  # 时间折扣率
epsilon = 0.5  # ε-greedy初始值
v = np.nanmax(Q, axis=1)
is_continue = True
episode = 1

V = []  # 存放每回合状态价值
V.append(np.nanmax(Q, axis=1))  # 求最大值

while is_continue:
    print("回合数:" + str(episode))

    epsilon = epsilon / 2

    # 更新Q
    [s_a_history, Q] = goal_maze_ret_s_a_Q(Q, epsilon, eta, gamma, pi_0)


    new_v = np.nanmax(Q, axis=1)
    print(np.sum(np.abs(new_v - v)))
    v = new_v
    V.append(v)

    print("求迷宫需要：" + str(len(s_a_history) - 1) + "步")


    episode = episode + 1
    if episode > 100:
        break