强化学习入门系列4

正文

欢迎来到强化学习系列的第4部分以及我们的Q-learning部分。在这一部分,我们将通过创造我们自己的学习环境来总结基本的q学习。我最初并没有打算把它作为一个教程来做,这只是我个人想做的事情,但是,在多次请求之后,它只是作为一个教程来做才有意义!
如果您多年来一直跟随我的教程,您就会知道我喜欢blob。我喜欢玩家blobs,食物,和坏敌人blobs!这在我的例子中是很重要的。它只适合这里出现的斑点。
我们的计划是创建一个玩家blob(蓝色),它的目标是尽可能快速地导航到食物blob(绿色),同时避开敌人blob(红色)。
现在,我们可以用高清晰度让它非常光滑,但是我们已经知道我们要把它分成观察空间。我们从一个离散空间开始。10x10到20x20之间的尺寸就足够了。请注意,你做的越大,q表在内存中占用的空间就越大,模型实际学习所需的时间也就越大。
因此,我们的环境将是一个20x20的网格,我们有一个玩家,一个敌人和一个食物。现在,我们只让玩家能够移动,试图到达食物,这将产生奖励。
让我们从将要使用的导入开始:

import numpy as np  # for array stuff and random
from PIL import Image  # for creating visual of our env
import cv2  # for showing our visual live
import matplotlib.pyplot as plt  # for graphing our mean rewards over time
import pickle  # to save/load Q-Tables
from matplotlib import style  # to make pretty charts because it matters.
import time  # using this to keep track of our saved Q-Tables.

style.use("ggplot")  # setting our style!

接下来,我们需要决定环境大小。我们在这里画一个方格。如前所述,规模将对我们的学习时间产生巨大影响。
例如,在本例中,一个10x10的q表约为15MB。一个20x20的是~195MB
首先,让我们从10开始,这样我们就可以了解这个十。

SIZE = 10

很好,现在有一些其他常量和一些变量

HM_EPISODES = 25000
MOVE_PENALTY = 1  # feel free to tinker with these!
ENEMY_PENALTY = 300  # feel free to tinker with these!
FOOD_REWARD = 25  # feel free to tinker with these!
epsilon = 0.5  # randomness
EPS_DECAY = 0.9999  # Every episode will be epsilon*EPS_DECAY
SHOW_EVERY = 1000  # how often to play through env visually.

start_q_table = None  # if we have a pickled Q table, we'll put the filename of it here.

LEARNING_RATE = 0.1
DISCOUNT = 0.95

PLAYER_N = 1  # player key in dict
FOOD_N = 2  # food key in dict
ENEMY_N = 3  # enemy key in dict

# the dict! Using just for colors
d = {1: (255, 175, 0),  # blueish color
     2: (0, 255, 0),  # green
     3: (0, 0, 255)}  # red

这些都应该是不言自明的,否则注释应该包括它。如果您不知道LEARNING_RATEDISCOUNT之类的东西是什么意思,那么您应该从本系列的开头开始。
接下来,这个环境由blob组成。这些“斑点”实际上就是正方形,但是,我称它们为斑点,对吧?

class Blob:
    def __init__(self):
        self.x = np.random.randint(0, SIZE)
        self.y = np.random.randint(0, SIZE)

我们从随机初始化这些blob开始。我们可能会有这样一个不幸的环境敌人和我们的斑点,或者敌人和食物,可以说是在同一个“块”上。艰难的运气。出于调试的目的,我需要一个string方法:

    def __str__(self):
        return f"{self.x}, {self.y}"

接下来,我们将进行一些操作符重载,以帮助我们进行观察。
我们需要对我们的环境进行某种观察来作为我们的状态。我建议我们直接把食物和敌人的x和y值交给我们的特工。为了简单起见,我将重写-(减法)运算符,这样我们就可以彼此减去两个blob。这个方法看起来像:

    def __sub__(self, other):
        return (self.x-other.x, self.y-other.y)

other是任何其他blob类型的对象(或者任何具有x和y属性的对象!
接下来,我将添加一个“action”方法,它将基于传递的“离散”动作移动。

    def action(self, choice):
        '''
        Gives us 4 total movement options. (0,1,2,3)
        '''
        if choice == 0:
            self.move(x=1, y=1)
        elif choice == 1:
            self.move(x=-1, y=-1)
        elif choice == 2:
            self.move(x=-1, y=1)
        elif choice == 3:
            self.move(x=1, y=-1))

现在,我们只需要move方法:

    def move(self, x=False, y=False):

        # If no value for x, move randomly
        if not x:
            self.x += np.random.randint(-1, 2)
        else:
            self.x += x

        # If no value for y, move randomly
        if not y:
            self.y += np.random.randint(-1, 2)
        else:
            self.y += y


        # If we are out of bounds, fix!
        if self.x < 0:
            self.x = 0
        elif self.x > SIZE-1:
            self.x = SIZE-1
        if self.y < 0:
            self.y = 0
        elif self.y > SIZE-1:
            self.y = SIZE-1

注释和代码应该很容易理解。如果x/y没有任何值,我们只是随机移动,否则我们将根据要求的移动。然后,我们最终控制blob超出边界的尝试。
现在,我们可以测试一下到目前为止我们得到了什么。例如:

player = Blob()
food = Blob()
enemy = Blob()


print(player)
print(food)
print(player-food)
player.move()
print(player-food)
player.action(2)
print(player-food)

输出为:

7, 5
8, 1
(-1, 4)
(-2, 5)
(-3, 6)

目前看来一切正常!现在让我们创建q_table:

if start_q_table is None:
    # initialize the q-table#
    q_table = {}
    for i in range(-SIZE+1, SIZE):
        for ii in range(-SIZE+1, SIZE):
            for iii in range(-SIZE+1, SIZE):
                    for iiii in range(-SIZE+1, SIZE):
                        q_table[((i, ii), (iii, iiii))] = [np.random.uniform(-5, 0) for i in range(4)]

这不是最有效的代码,但它应该涵盖我们的所有基础。尽管这个表对于Q-Learning来说已经相当大了,但是Python仍然能够快速生成这样大的表。对我来说大约需要2秒,这是一个只需要运行一次的操作,所以对我来说没问题。请随意改进它!
一旦我们开始训练,然而,我们实际上可能已经保存了一个q表,所以我们将处理,如果我们为q表设置了一个文件名:

else:
    with open(start_q_table, "rb") as f:
        q_table = pickle.load(f)

需要注意的是,要查阅你的q表,你可以这样做:

print(q_table[((-9, -2), (3, 9))])

好了,我们准备开始迭代集了!

episode_rewards = []

for episode in range(HM_EPISODES):
    player = Blob()
    food = Blob()
    enemy = Blob()

我们将通过episode_rewards追踪剧集奖励以及它们随时间的改进。对于每一个新情节,我们都会重新初始化玩家、食物和敌人的物品。接下来,让我们来处理逻辑可视化或不可视化:

    if episode % SHOW_EVERY == 0:
        print(f"on #{episode}, epsilon is {epsilon}")
        print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
        show = True
    else:
        show = False

现在我们开始实际的框架/步骤的情节:

    episode_reward = 0
    for i in range(200):
        obs = (player-food, player-enemy)
        #print(obs)
        if np.random.random() > epsilon:
            # GET THE ACTION
            action = np.argmax(q_table[obs])
        else:
            action = np.random.randint(0, 4)
        # Take the action!
        player.action(action)

我还会指出,这里是我们可以选择移动其他对象的地方:

        #### MAYBE ###
        #enemy.move()
        #food.move()
        ##############

我怀疑移动这些东西会在一开始就伤害到训练,我的预期是我们可能会在训练了很长一段时间后移动它们,或者可能只是在训练完之后移动它们。同样的算法训练了食物/敌人的移动,应该可以用于那些移动的物体。我只是认为那些移动的东西会混淆算法的学习。
现在我们来处理奖励:

        if player.x == enemy.x and player.y == enemy.y:
            reward = -ENEMY_PENALTY
        elif player.x == food.x and player.y == food.y:
            reward = FOOD_REWARD
        else:
            reward = -MOVE_PENALTY

一旦我们有了奖励信息,我们可以建立我们的q表和q值信息:

        new_obs = (player-food, player-enemy)  # new observation
        max_future_q = np.max(q_table[new_obs])  # max Q value for this new obs
        current_q = q_table[obs][action]  # current Q for our chosen action

有了这些值,我们可以进行计算:

        if reward == FOOD_REWARD:
            new_q = FOOD_REWARD
        else:
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

虽然我们不需要把这个环境形象化,但它可以帮助我们确保我们没有犯错误。
我在跟谁开玩笑呢,我们只是想看看我们的杰作!!
…但我们不想每集都看到它,因为我们会有很多这样的东西。

        if show:
            env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8)  # starts an rbg of our size
            env[food.x][food.y] = d[FOOD_N]  # sets the food location tile to green color
            env[player.x][player.y] = d[PLAYER_N]  # sets the player tile to blue
            env[enemy.x][enemy.y] = d[ENEMY_N]  # sets the enemy location to red
            img = Image.fromarray(env, 'RGB')  # reading to rgb. Apparently. Even tho color definitions are bgr. ???
            img = img.resize((300, 300))  # resizing so we can see our agent in all its glory.
            cv2.imshow("image", np.array(img))  # show it!
            if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:  # crummy code to hang at the end if we reach abrupt end for good reasons or not.
                if cv2.waitKey(500) & 0xFF == ord('q'):
                    break
            else:
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break

注释应该解释上面的代码。接下来,我们要处理我们的奖励:

        episode_reward += reward

如果我们达到目标或击中敌人,就停止!

        if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:
            break

在所有这些之后,我们返回一个标签,并结束我们的每集循环:

    episode_rewards.append(episode_reward)
    epsilon *= EPS_DECAY

整个剧情循环是这样的:

episode_rewards = []

for episode in range(HM_EPISODES):
    player = Blob()
    food = Blob()
    enemy = Blob()
    if episode % SHOW_EVERY == 0:
        print(f"on #{episode}, epsilon is {epsilon}")
        print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
        show = True
    else:
        show = False

    episode_reward = 0
    for i in range(200):
        obs = (player-food, player-enemy)
        #print(obs)
        if np.random.random() > epsilon:
            # GET THE ACTION
            action = np.argmax(q_table[obs])
        else:
            action = np.random.randint(0, 4)
        # Take the action!
        player.action(action)

        #### MAYBE ###
        #enemy.move()
        #food.move()
        ##############

        if player.x == enemy.x and player.y == enemy.y:
            reward = -ENEMY_PENALTY
        elif player.x == food.x and player.y == food.y:
            reward = FOOD_REWARD
        else:
            reward = -MOVE_PENALTY
        ## NOW WE KNOW THE REWARD, LET'S CALC YO
        # first we need to obs immediately after the move.
        new_obs = (player-food, player-enemy)
        max_future_q = np.max(q_table[new_obs])
        current_q = q_table[obs][action]

        if reward == FOOD_REWARD:
            new_q = FOOD_REWARD
        else:
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
        q_table[obs][action] = new_q

        if show:
            env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8)  # starts an rbg of our size
            env[food.x][food.y] = d[FOOD_N]  # sets the food location tile to green color
            env[player.x][player.y] = d[PLAYER_N]  # sets the player tile to blue
            env[enemy.x][enemy.y] = d[ENEMY_N]  # sets the enemy location to red
            img = Image.fromarray(env, 'RGB')  # reading to rgb. Apparently. Even tho color definitions are bgr. ???
            img = img.resize((300, 300))  # resizing so we can see our agent in all its glory.
            cv2.imshow("image", np.array(img))  # show it!
            if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:  # crummy code to hang at the end if we reach abrupt end for good reasons or not.
                if cv2.waitKey(500) & 0xFF == ord('q'):
                    break
            else:
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break

        episode_reward += reward
        if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:
            break

    #print(episode_reward)
    episode_rewards.append(episode_reward)
    epsilon *= EPS_DECAY

现在我们终于可以绘图并保存了:

moving_avg = np.convolve(episode_rewards, np.ones((SHOW_EVERY,))/SHOW_EVERY, mode='valid')

plt.plot([i for i in range(len(moving_avg))], moving_avg)
plt.ylabel(f"Reward {SHOW_EVERY}ma")
plt.xlabel("episode #")
plt.show()

with open(f"qtable-{int(time.time())}.pickle", "wb") as f:
    pickle.dump(q_table, f)

完整的代码是:

import numpy as np
from PIL import Image
import cv2
import matplotlib.pyplot as plt
import pickle
from matplotlib import style
import time

style.use("ggplot")

SIZE = 10

HM_EPISODES = 25000
MOVE_PENALTY = 1
ENEMY_PENALTY = 300
FOOD_REWARD = 25
epsilon = 0.9
EPS_DECAY = 0.9998  # Every episode will be epsilon*EPS_DECAY
SHOW_EVERY = 3000  # how often to play through env visually.

start_q_table = None # None or Filename

LEARNING_RATE = 0.1
DISCOUNT = 0.95

PLAYER_N = 1  # player key in dict
FOOD_N = 2  # food key in dict
ENEMY_N = 3  # enemy key in dict

# the dict!
d = {1: (255, 175, 0),
     2: (0, 255, 0),
     3: (0, 0, 255)}


class Blob:
    def __init__(self):
        self.x = np.random.randint(0, SIZE)
        self.y = np.random.randint(0, SIZE)

    def __str__(self):
        return f"{self.x}, {self.y}"

    def __sub__(self, other):
        return (self.x-other.x, self.y-other.y)

    def action(self, choice):
        '''
        Gives us 4 total movement options. (0,1,2,3)
        '''
        if choice == 0:
            self.move(x=1, y=1)
        elif choice == 1:
            self.move(x=-1, y=-1)
        elif choice == 2:
            self.move(x=-1, y=1)
        elif choice == 3:
            self.move(x=1, y=-1)

    def move(self, x=False, y=False):

        # If no value for x, move randomly
        if not x:
            self.x += np.random.randint(-1, 2)
        else:
            self.x += x

        # If no value for y, move randomly
        if not y:
            self.y += np.random.randint(-1, 2)
        else:
            self.y += y


        # If we are out of bounds, fix!
        if self.x < 0:
            self.x = 0
        elif self.x > SIZE-1:
            self.x = SIZE-1
        if self.y < 0:
            self.y = 0
        elif self.y > SIZE-1:
            self.y = SIZE-1


if start_q_table is None:
    # initialize the q-table#
    q_table = {}
    for i in range(-SIZE+1, SIZE):
        for ii in range(-SIZE+1, SIZE):
            for iii in range(-SIZE+1, SIZE):
                    for iiii in range(-SIZE+1, SIZE):
                        q_table[((i, ii), (iii, iiii))] = [np.random.uniform(-5, 0) for i in range(4)]

else:
    with open(start_q_table, "rb") as f:
        q_table = pickle.load(f)


# can look up from Q-table with: print(q_table[((-9, -2), (3, 9))]) for example

episode_rewards = []

for episode in range(HM_EPISODES):
    player = Blob()
    food = Blob()
    enemy = Blob()
    if episode % SHOW_EVERY == 0:
        print(f"on #{episode}, epsilon is {epsilon}")
        print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
        show = True
    else:
        show = False

    episode_reward = 0
    for i in range(200):
        obs = (player-food, player-enemy)
        #print(obs)
        if np.random.random() > epsilon:
            # GET THE ACTION
            action = np.argmax(q_table[obs])
        else:
            action = np.random.randint(0, 4)
        # Take the action!
        player.action(action)

        #### MAYBE ###
        #enemy.move()
        #food.move()
        ##############

        if player.x == enemy.x and player.y == enemy.y:
            reward = -ENEMY_PENALTY
        elif player.x == food.x and player.y == food.y:
            reward = FOOD_REWARD
        else:
            reward = -MOVE_PENALTY
        ## NOW WE KNOW THE REWARD, LET'S CALC YO
        # first we need to obs immediately after the move.
        new_obs = (player-food, player-enemy)
        max_future_q = np.max(q_table[new_obs])
        current_q = q_table[obs][action]

        if reward == FOOD_REWARD:
            new_q = FOOD_REWARD
        else:
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
        q_table[obs][action] = new_q

        if show:
            env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8)  # starts an rbg of our size
            env[food.x][food.y] = d[FOOD_N]  # sets the food location tile to green color
            env[player.x][player.y] = d[PLAYER_N]  # sets the player tile to blue
            env[enemy.x][enemy.y] = d[ENEMY_N]  # sets the enemy location to red
            img = Image.fromarray(env, 'RGB')  # reading to rgb. Apparently. Even tho color definitions are bgr. ???
            img = img.resize((300, 300))  # resizing so we can see our agent in all its glory.
            cv2.imshow("image", np.array(img))  # show it!
            if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:  # crummy code to hang at the end if we reach abrupt end for good reasons or not.
                if cv2.waitKey(500) & 0xFF == ord('q'):
                    break
            else:
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break

        episode_reward += reward
        if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:
            break

    #print(episode_reward)
    episode_rewards.append(episode_reward)
    epsilon *= EPS_DECAY

moving_avg = np.convolve(episode_rewards, np.ones((SHOW_EVERY,))/SHOW_EVERY, mode='valid')

plt.plot([i for i in range(len(moving_avg))], moving_avg)
plt.ylabel(f"Reward {SHOW_EVERY}ma")
plt.xlabel("episode #")
plt.show()

with open(f"qtable-{int(time.time())}.pickle", "wb") as f:
    pickle.dump(q_table, f)

你应该看到一些例子,然后在最后得到一个图表,像:
在这里插入图片描述
在关闭那个图之后,q-表将会保存,以及q-表的时间戳。现在,我们可以加载这个表格,可以玩,可以学习,或者两者都有。例如,我们可以将SHOW_EVERY改为1:

SHOW_EVERY = 1

然后epsilon改为0

epsilon = 0.0

然后,我们可以看到这样的结果:
https://pythonprogramming.net/static/images/reinforcement-learning/10x10-initial.mp4
这不是我见过的最聪明的事。让我们保持训练。首先,我们更新:

start_q_table = "qtable-1559485134.pickle" # None or Filename

确保你的时间戳是正确的,不要只是复制我的。我将设置:

HM_EPISODES = 50000

然后将epsilon回到1。

epsilon = 1.0

然后SHOW_EVERY改为

SHOW_EVERY = 5000  # how often to play through env visually.

在这里插入图片描述
好的,载入新的q表,设为0,显示每一个到1,让我们看看我们做得怎么样。
https://pythonprogramming.net/static/images/reinforcement-learning/10x10-75k-episodes.mp4
好吧!我想我们已经解决了。看起来帅呆了!
…但是运动呢?我们试着把它打开。

        #### Movement for food and enemy ###
        enemy.move()
        food.move()
        ##############
在step循环内(for i in range(200))

https://pythonprogramming.net/static/images/reinforcement-learning/10x10-75k-episodes-move-on.mp4
所以你可以看到它实际上是和运动一起工作的,就在盒子的右边,因为模型实际上只是被训练基于相对位置的delta来移动。然而,这个模型没有准备好面对的一件事是,如果敌人离得很近,从理论上讲,敌人和玩家可以移动到同一个方块。在所有的训练中,玩家都可以在敌人周围安全移动。随着移动的继续,情况就不一样了,所以你仍然可能想要从这里开始进行移动训练,也许你会看到玩家的一些新能力出现。
我想看看从10x10到20x20会有多大的不同。正如我们所看到的,10x10需要25k到75K的训练集来学习。我发现一个20x20的模型需要250万集才能学会。
幸运的是,每集的训练时间并不长,所以这也没什么大不了的。如果你好奇,以下是研究结果:
敌人/食物不能移动:
https://pythonprogramming.net/static/images/reinforcement-learning/20x20-no-move-2.mp4
敌人和食物可以移动:
https://pythonprogramming.net/static/images/reinforcement-learning/20x20-yes-move.mp4
在我把这个密码分享给丹尼尔之后,他建议这个斑点应该只有一半的时间可以够到食物,因为动作只能是对角线的。这其实是……一个很好的观点。我在编写代码时并没有考虑到这一点,我只是想让动作空间更小。
如果食物的x是偶数,玩家的x是奇数…或者y也一样,那么它就不可能直接移动到食物上。
…除非我们用墙。问题在于玩家并不知道一堵墙,甚至不知道自己在环境中的位置。玩家唯一知道的就是食物和敌人的相对位置。因此,即使是像需要使用墙壁这样的微妙属性,也需要代理人员一直向上/向下/向左/向右移动,以弥补x或y偶数/奇数问题的差异。我真的很震惊代理能知道这些。在移动和不移动的情况下,成功率几乎是完美的,抛开这个基本问题,我没有期望Q-Learning能够如此容易地处理。
我鼓励您对这种环境进行更多的修改。我们有很多东西,比如大小、奖励、惩罚、可能的行为,甚至观察结果都可以调整。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值