正文
欢迎来到强化学习系列的第4部分以及我们的Q-learning部分。在这一部分,我们将通过创造我们自己的学习环境来总结基本的q学习。我最初并没有打算把它作为一个教程来做,这只是我个人想做的事情,但是,在多次请求之后,它只是作为一个教程来做才有意义!
如果您多年来一直跟随我的教程,您就会知道我喜欢blob。我喜欢玩家blobs,食物,和坏敌人blobs!这在我的例子中是很重要的。它只适合这里出现的斑点。
我们的计划是创建一个玩家blob(蓝色),它的目标是尽可能快速地导航到食物blob(绿色),同时避开敌人blob(红色)。
现在,我们可以用高清晰度让它非常光滑,但是我们已经知道我们要把它分成观察空间。我们从一个离散空间开始。10x10到20x20之间的尺寸就足够了。请注意,你做的越大,q表在内存中占用的空间就越大,模型实际学习所需的时间也就越大。
因此,我们的环境将是一个20x20的网格,我们有一个玩家,一个敌人和一个食物。现在,我们只让玩家能够移动,试图到达食物,这将产生奖励。
让我们从将要使用的导入开始:
import numpy as np # for array stuff and random
from PIL import Image # for creating visual of our env
import cv2 # for showing our visual live
import matplotlib.pyplot as plt # for graphing our mean rewards over time
import pickle # to save/load Q-Tables
from matplotlib import style # to make pretty charts because it matters.
import time # using this to keep track of our saved Q-Tables.
style.use("ggplot") # setting our style!
接下来,我们需要决定环境大小。我们在这里画一个方格。如前所述,规模将对我们的学习时间产生巨大影响。
例如,在本例中,一个10x10的q表约为15MB。一个20x20的是~195MB
首先,让我们从10开始,这样我们就可以了解这个十。
SIZE = 10
很好,现在有一些其他常量和一些变量
HM_EPISODES = 25000
MOVE_PENALTY = 1 # feel free to tinker with these!
ENEMY_PENALTY = 300 # feel free to tinker with these!
FOOD_REWARD = 25 # feel free to tinker with these!
epsilon = 0.5 # randomness
EPS_DECAY = 0.9999 # Every episode will be epsilon*EPS_DECAY
SHOW_EVERY = 1000 # how often to play through env visually.
start_q_table = None # if we have a pickled Q table, we'll put the filename of it here.
LEARNING_RATE = 0.1
DISCOUNT = 0.95
PLAYER_N = 1 # player key in dict
FOOD_N = 2 # food key in dict
ENEMY_N = 3 # enemy key in dict
# the dict! Using just for colors
d = {1: (255, 175, 0), # blueish color
2: (0, 255, 0), # green
3: (0, 0, 255)} # red
这些都应该是不言自明的,否则注释应该包括它。如果您不知道LEARNING_RATE
或DISCOUNT
之类的东西是什么意思,那么您应该从本系列的开头开始。
接下来,这个环境由blob组成。这些“斑点”实际上就是正方形,但是,我称它们为斑点,对吧?
class Blob:
def __init__(self):
self.x = np.random.randint(0, SIZE)
self.y = np.random.randint(0, SIZE)
我们从随机初始化这些blob开始。我们可能会有这样一个不幸的环境敌人和我们的斑点,或者敌人和食物,可以说是在同一个“块”上。艰难的运气。出于调试的目的,我需要一个string方法:
def __str__(self):
return f"{self.x}, {self.y}"
接下来,我们将进行一些操作符重载,以帮助我们进行观察。
我们需要对我们的环境进行某种观察来作为我们的状态。我建议我们直接把食物和敌人的x和y值交给我们的特工。为了简单起见,我将重写-(减法)运算符,这样我们就可以彼此减去两个blob。这个方法看起来像:
def __sub__(self, other):
return (self.x-other.x, self.y-other.y)
other
是任何其他blob类型的对象(或者任何具有x和y属性的对象!
接下来,我将添加一个“action”方法,它将基于传递的“离散”动作移动。
def action(self, choice):
'''
Gives us 4 total movement options. (0,1,2,3)
'''
if choice == 0:
self.move(x=1, y=1)
elif choice == 1:
self.move(x=-1, y=-1)
elif choice == 2:
self.move(x=-1, y=1)
elif choice == 3:
self.move(x=1, y=-1))
现在,我们只需要move方法:
def move(self, x=False, y=False):
# If no value for x, move randomly
if not x:
self.x += np.random.randint(-1, 2)
else:
self.x += x
# If no value for y, move randomly
if not y:
self.y += np.random.randint(-1, 2)
else:
self.y += y
# If we are out of bounds, fix!
if self.x < 0:
self.x = 0
elif self.x > SIZE-1:
self.x = SIZE-1
if self.y < 0:
self.y = 0
elif self.y > SIZE-1:
self.y = SIZE-1
注释和代码应该很容易理解。如果x/y没有任何值,我们只是随机移动,否则我们将根据要求的移动。然后,我们最终控制blob超出边界的尝试。
现在,我们可以测试一下到目前为止我们得到了什么。例如:
player = Blob()
food = Blob()
enemy = Blob()
print(player)
print(food)
print(player-food)
player.move()
print(player-food)
player.action(2)
print(player-food)
输出为:
7, 5
8, 1
(-1, 4)
(-2, 5)
(-3, 6)
目前看来一切正常!现在让我们创建q_table
:
if start_q_table is None:
# initialize the q-table#
q_table = {}
for i in range(-SIZE+1, SIZE):
for ii in range(-SIZE+1, SIZE):
for iii in range(-SIZE+1, SIZE):
for iiii in range(-SIZE+1, SIZE):
q_table[((i, ii), (iii, iiii))] = [np.random.uniform(-5, 0) for i in range(4)]
这不是最有效的代码,但它应该涵盖我们的所有基础。尽管这个表对于Q-Learning来说已经相当大了,但是Python仍然能够快速生成这样大的表。对我来说大约需要2秒,这是一个只需要运行一次的操作,所以对我来说没问题。请随意改进它!
一旦我们开始训练,然而,我们实际上可能已经保存了一个q表,所以我们将处理,如果我们为q表设置了一个文件名:
else:
with open(start_q_table, "rb") as f:
q_table = pickle.load(f)
需要注意的是,要查阅你的q表,你可以这样做:
print(q_table[((-9, -2), (3, 9))])
好了,我们准备开始迭代集了!
episode_rewards = []
for episode in range(HM_EPISODES):
player = Blob()
food = Blob()
enemy = Blob()
我们将通过episode_rewards
追踪剧集奖励以及它们随时间的改进。对于每一个新情节,我们都会重新初始化玩家、食物和敌人的物品。接下来,让我们来处理逻辑可视化或不可视化:
if episode % SHOW_EVERY == 0:
print(f"on #{episode}, epsilon is {epsilon}")
print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
show = True
else:
show = False
现在我们开始实际的框架/步骤的情节:
episode_reward = 0
for i in range(200):
obs = (player-food, player-enemy)
#print(obs)
if np.random.random() > epsilon:
# GET THE ACTION
action = np.argmax(q_table[obs])
else:
action = np.random.randint(0, 4)
# Take the action!
player.action(action)
我还会指出,这里是我们可以选择移动其他对象的地方:
#### MAYBE ###
#enemy.move()
#food.move()
##############
我怀疑移动这些东西会在一开始就伤害到训练,我的预期是我们可能会在训练了很长一段时间后移动它们,或者可能只是在训练完之后移动它们。同样的算法训练了食物/敌人的移动,应该可以用于那些移动的物体。我只是认为那些移动的东西会混淆算法的学习。
现在我们来处理奖励:
if player.x == enemy.x and player.y == enemy.y:
reward = -ENEMY_PENALTY
elif player.x == food.x and player.y == food.y:
reward = FOOD_REWARD
else:
reward = -MOVE_PENALTY
一旦我们有了奖励信息,我们可以建立我们的q表和q值信息:
new_obs = (player-food, player-enemy) # new observation
max_future_q = np.max(q_table[new_obs]) # max Q value for this new obs
current_q = q_table[obs][action] # current Q for our chosen action
有了这些值,我们可以进行计算:
if reward == FOOD_REWARD:
new_q = FOOD_REWARD
else:
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
虽然我们不需要把这个环境形象化,但它可以帮助我们确保我们没有犯错误。
我在跟谁开玩笑呢,我们只是想看看我们的杰作!!
…但我们不想每集都看到它,因为我们会有很多这样的东西。
if show:
env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8) # starts an rbg of our size
env[food.x][food.y] = d[FOOD_N] # sets the food location tile to green color
env[player.x][player.y] = d[PLAYER_N] # sets the player tile to blue
env[enemy.x][enemy.y] = d[ENEMY_N] # sets the enemy location to red
img = Image.fromarray(env, 'RGB') # reading to rgb. Apparently. Even tho color definitions are bgr. ???
img = img.resize((300, 300)) # resizing so we can see our agent in all its glory.
cv2.imshow("image", np.array(img)) # show it!
if reward == FOOD_REWARD or reward == -ENEMY_PENALTY: # crummy code to hang at the end if we reach abrupt end for good reasons or not.
if cv2.waitKey(500) & 0xFF == ord('q'):
break
else:
if cv2.waitKey(1) & 0xFF == ord('q'):
break
注释应该解释上面的代码。接下来,我们要处理我们的奖励:
episode_reward += reward
如果我们达到目标或击中敌人,就停止!
if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:
break
在所有这些之后,我们返回一个标签,并结束我们的每集循环:
episode_rewards.append(episode_reward)
epsilon *= EPS_DECAY
整个剧情循环是这样的:
episode_rewards = []
for episode in range(HM_EPISODES):
player = Blob()
food = Blob()
enemy = Blob()
if episode % SHOW_EVERY == 0:
print(f"on #{episode}, epsilon is {epsilon}")
print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
show = True
else:
show = False
episode_reward = 0
for i in range(200):
obs = (player-food, player-enemy)
#print(obs)
if np.random.random() > epsilon:
# GET THE ACTION
action = np.argmax(q_table[obs])
else:
action = np.random.randint(0, 4)
# Take the action!
player.action(action)
#### MAYBE ###
#enemy.move()
#food.move()
##############
if player.x == enemy.x and player.y == enemy.y:
reward = -ENEMY_PENALTY
elif player.x == food.x and player.y == food.y:
reward = FOOD_REWARD
else:
reward = -MOVE_PENALTY
## NOW WE KNOW THE REWARD, LET'S CALC YO
# first we need to obs immediately after the move.
new_obs = (player-food, player-enemy)
max_future_q = np.max(q_table[new_obs])
current_q = q_table[obs][action]
if reward == FOOD_REWARD:
new_q = FOOD_REWARD
else:
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
q_table[obs][action] = new_q
if show:
env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8) # starts an rbg of our size
env[food.x][food.y] = d[FOOD_N] # sets the food location tile to green color
env[player.x][player.y] = d[PLAYER_N] # sets the player tile to blue
env[enemy.x][enemy.y] = d[ENEMY_N] # sets the enemy location to red
img = Image.fromarray(env, 'RGB') # reading to rgb. Apparently. Even tho color definitions are bgr. ???
img = img.resize((300, 300)) # resizing so we can see our agent in all its glory.
cv2.imshow("image", np.array(img)) # show it!
if reward == FOOD_REWARD or reward == -ENEMY_PENALTY: # crummy code to hang at the end if we reach abrupt end for good reasons or not.
if cv2.waitKey(500) & 0xFF == ord('q'):
break
else:
if cv2.waitKey(1) & 0xFF == ord('q'):
break
episode_reward += reward
if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:
break
#print(episode_reward)
episode_rewards.append(episode_reward)
epsilon *= EPS_DECAY
现在我们终于可以绘图并保存了:
moving_avg = np.convolve(episode_rewards, np.ones((SHOW_EVERY,))/SHOW_EVERY, mode='valid')
plt.plot([i for i in range(len(moving_avg))], moving_avg)
plt.ylabel(f"Reward {SHOW_EVERY}ma")
plt.xlabel("episode #")
plt.show()
with open(f"qtable-{int(time.time())}.pickle", "wb") as f:
pickle.dump(q_table, f)
完整的代码是:
import numpy as np
from PIL import Image
import cv2
import matplotlib.pyplot as plt
import pickle
from matplotlib import style
import time
style.use("ggplot")
SIZE = 10
HM_EPISODES = 25000
MOVE_PENALTY = 1
ENEMY_PENALTY = 300
FOOD_REWARD = 25
epsilon = 0.9
EPS_DECAY = 0.9998 # Every episode will be epsilon*EPS_DECAY
SHOW_EVERY = 3000 # how often to play through env visually.
start_q_table = None # None or Filename
LEARNING_RATE = 0.1
DISCOUNT = 0.95
PLAYER_N = 1 # player key in dict
FOOD_N = 2 # food key in dict
ENEMY_N = 3 # enemy key in dict
# the dict!
d = {1: (255, 175, 0),
2: (0, 255, 0),
3: (0, 0, 255)}
class Blob:
def __init__(self):
self.x = np.random.randint(0, SIZE)
self.y = np.random.randint(0, SIZE)
def __str__(self):
return f"{self.x}, {self.y}"
def __sub__(self, other):
return (self.x-other.x, self.y-other.y)
def action(self, choice):
'''
Gives us 4 total movement options. (0,1,2,3)
'''
if choice == 0:
self.move(x=1, y=1)
elif choice == 1:
self.move(x=-1, y=-1)
elif choice == 2:
self.move(x=-1, y=1)
elif choice == 3:
self.move(x=1, y=-1)
def move(self, x=False, y=False):
# If no value for x, move randomly
if not x:
self.x += np.random.randint(-1, 2)
else:
self.x += x
# If no value for y, move randomly
if not y:
self.y += np.random.randint(-1, 2)
else:
self.y += y
# If we are out of bounds, fix!
if self.x < 0:
self.x = 0
elif self.x > SIZE-1:
self.x = SIZE-1
if self.y < 0:
self.y = 0
elif self.y > SIZE-1:
self.y = SIZE-1
if start_q_table is None:
# initialize the q-table#
q_table = {}
for i in range(-SIZE+1, SIZE):
for ii in range(-SIZE+1, SIZE):
for iii in range(-SIZE+1, SIZE):
for iiii in range(-SIZE+1, SIZE):
q_table[((i, ii), (iii, iiii))] = [np.random.uniform(-5, 0) for i in range(4)]
else:
with open(start_q_table, "rb") as f:
q_table = pickle.load(f)
# can look up from Q-table with: print(q_table[((-9, -2), (3, 9))]) for example
episode_rewards = []
for episode in range(HM_EPISODES):
player = Blob()
food = Blob()
enemy = Blob()
if episode % SHOW_EVERY == 0:
print(f"on #{episode}, epsilon is {epsilon}")
print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
show = True
else:
show = False
episode_reward = 0
for i in range(200):
obs = (player-food, player-enemy)
#print(obs)
if np.random.random() > epsilon:
# GET THE ACTION
action = np.argmax(q_table[obs])
else:
action = np.random.randint(0, 4)
# Take the action!
player.action(action)
#### MAYBE ###
#enemy.move()
#food.move()
##############
if player.x == enemy.x and player.y == enemy.y:
reward = -ENEMY_PENALTY
elif player.x == food.x and player.y == food.y:
reward = FOOD_REWARD
else:
reward = -MOVE_PENALTY
## NOW WE KNOW THE REWARD, LET'S CALC YO
# first we need to obs immediately after the move.
new_obs = (player-food, player-enemy)
max_future_q = np.max(q_table[new_obs])
current_q = q_table[obs][action]
if reward == FOOD_REWARD:
new_q = FOOD_REWARD
else:
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
q_table[obs][action] = new_q
if show:
env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8) # starts an rbg of our size
env[food.x][food.y] = d[FOOD_N] # sets the food location tile to green color
env[player.x][player.y] = d[PLAYER_N] # sets the player tile to blue
env[enemy.x][enemy.y] = d[ENEMY_N] # sets the enemy location to red
img = Image.fromarray(env, 'RGB') # reading to rgb. Apparently. Even tho color definitions are bgr. ???
img = img.resize((300, 300)) # resizing so we can see our agent in all its glory.
cv2.imshow("image", np.array(img)) # show it!
if reward == FOOD_REWARD or reward == -ENEMY_PENALTY: # crummy code to hang at the end if we reach abrupt end for good reasons or not.
if cv2.waitKey(500) & 0xFF == ord('q'):
break
else:
if cv2.waitKey(1) & 0xFF == ord('q'):
break
episode_reward += reward
if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:
break
#print(episode_reward)
episode_rewards.append(episode_reward)
epsilon *= EPS_DECAY
moving_avg = np.convolve(episode_rewards, np.ones((SHOW_EVERY,))/SHOW_EVERY, mode='valid')
plt.plot([i for i in range(len(moving_avg))], moving_avg)
plt.ylabel(f"Reward {SHOW_EVERY}ma")
plt.xlabel("episode #")
plt.show()
with open(f"qtable-{int(time.time())}.pickle", "wb") as f:
pickle.dump(q_table, f)
你应该看到一些例子,然后在最后得到一个图表,像:
在关闭那个图之后,q-表将会保存,以及q-表的时间戳。现在,我们可以加载这个表格,可以玩,可以学习,或者两者都有。例如,我们可以将SHOW_EVERY
改为1:
SHOW_EVERY = 1
然后epsilon
改为0
epsilon = 0.0
然后,我们可以看到这样的结果:
https://pythonprogramming.net/static/images/reinforcement-learning/10x10-initial.mp4
这不是我见过的最聪明的事。让我们保持训练。首先,我们更新:
start_q_table = "qtable-1559485134.pickle" # None or Filename
确保你的时间戳是正确的,不要只是复制我的。我将设置:
HM_EPISODES = 50000
然后将epsilon
回到1。
epsilon = 1.0
然后SHOW_EVERY
改为
SHOW_EVERY = 5000 # how often to play through env visually.
好的,载入新的q表,设为0,显示每一个到1,让我们看看我们做得怎么样。
https://pythonprogramming.net/static/images/reinforcement-learning/10x10-75k-episodes.mp4
好吧!我想我们已经解决了。看起来帅呆了!
…但是运动呢?我们试着把它打开。
#### Movement for food and enemy ###
enemy.move()
food.move()
##############
在step循环内(for i in range(200))。
https://pythonprogramming.net/static/images/reinforcement-learning/10x10-75k-episodes-move-on.mp4
所以你可以看到它实际上是和运动一起工作的,就在盒子的右边,因为模型实际上只是被训练基于相对位置的delta来移动。然而,这个模型没有准备好面对的一件事是,如果敌人离得很近,从理论上讲,敌人和玩家可以移动到同一个方块。在所有的训练中,玩家都可以在敌人周围安全移动。随着移动的继续,情况就不一样了,所以你仍然可能想要从这里开始进行移动训练,也许你会看到玩家的一些新能力出现。
我想看看从10x10到20x20会有多大的不同。正如我们所看到的,10x10需要25k到75K的训练集来学习。我发现一个20x20的模型需要250万集才能学会。
幸运的是,每集的训练时间并不长,所以这也没什么大不了的。如果你好奇,以下是研究结果:
敌人/食物不能移动:
https://pythonprogramming.net/static/images/reinforcement-learning/20x20-no-move-2.mp4
敌人和食物可以移动:
https://pythonprogramming.net/static/images/reinforcement-learning/20x20-yes-move.mp4
在我把这个密码分享给丹尼尔之后,他建议这个斑点应该只有一半的时间可以够到食物,因为动作只能是对角线的。这其实是……一个很好的观点。我在编写代码时并没有考虑到这一点,我只是想让动作空间更小。
如果食物的x是偶数,玩家的x是奇数…或者y也一样,那么它就不可能直接移动到食物上。
…除非我们用墙。问题在于玩家并不知道一堵墙,甚至不知道自己在环境中的位置。玩家唯一知道的就是食物和敌人的相对位置。因此,即使是像需要使用墙壁这样的微妙属性,也需要代理人员一直向上/向下/向左/向右移动,以弥补x或y偶数/奇数问题的差异。我真的很震惊代理能知道这些。在移动和不移动的情况下,成功率几乎是完美的,抛开这个基本问题,我没有期望Q-Learning能够如此容易地处理。
我鼓励您对这种环境进行更多的修改。我们有很多东西,比如大小、奖励、惩罚、可能的行为,甚至观察结果都可以调整。