Beatris邪恶的俄罗斯方块ai

最新推荐文章于 2024-07-10 15:05:55 发布

weixin_26630173

最新推荐文章于 2024-07-10 15:05:55 发布

阅读量471

点赞数

原文链接：https://medium.com/@amoghhgoma/beatris-an-evil-tetris-ai-88fee6b068

版权

该项目创建了一个名为Beatris的恶意AI，专门设计来对抗Tetris AI，通过提供最具挑战性的游戏块，使其游戏体验变得极其艰难。经过深度强化学习训练，最终的AI能在游戏中显著降低对手的平均和最高得分。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

This project was built by Amogh Agnihotri, Reese Costis, Prajakta Joshi, Rebecca Phung, Suhas Raja, and Ramya Rajasekaran. Play against Beatris and find our scripts and models here.

该项目由Amogh Agnihotri，Reese Costis，Prajakta Joshi，Rebecca Phung，Suhas Raja和Ramya Rajasekaran建造。与Beatris对抗，并在此处找到我们的脚本和模型。

背景 (Background)

Tetris was first created in 1984 by Soviet Russian software engineer Alexey Pajitnov and was the first game exported to the United States from the Soviet Union. After its release on the GameBoy in 1989, Tetris has been widely regarded as one of the greatest video games of all time and is now the second most downloaded game of all time behind Minecraft.

《俄罗斯方块》最初是由苏联俄罗斯软件工程师Alexey Pajitnov于1984年创建的，并且是第一款从苏联出口到美国的游戏。自1989年在俄罗斯的GameBoy发行以来，《俄罗斯方块》一直被认为是有史以来最伟大的视频游戏之一，现在是有史以来下载量排名第二的游戏，仅次于《我的世界》。

The game itself is a tile-matching puzzle video game where different geometric shapes descend from the top of the screen. Points are accumulated when an entire row of blocks are filled at which point the row is cleared. If the player cannot clear rows quickly enough, the screen will fill with blocks and the game will end. The game never ends in player victory; the player can only clear as many lines as possible until an inevitable loss.

该游戏本身是一款贴砖配对的益智视频游戏，其不同的几何形状从屏幕顶部下降。当填满整行块时，将累积点，在该点清除该行。如果玩家不能足够快地清除行，则屏幕将充满块，游戏将结束。游戏永远不会以玩家的胜利而结束；玩家只能清除尽可能多的线，直到不可避免的损失。

Tetris is a popular game in the mainstream and in the software developers’ community. Many AIs have been created to beat the game. However, not as many AIs have been made to make the game as hard as possible for the player. Our goal for this project was to create a malicious AI that chooses the hardest pieces to give a smart Tetris AI playing the game. We’ll call the smart Tetris AI the “agent” and the malicious game AI as Beat Tetris, or “Beatris” for short. We chose an agent implementation from GitHub user nuno-faria. You can find the implementation code here. His AI uses deep reinforcement learning and q-learning to learn how to beat normal Tetris within 2000 game plays. Using organic game results, we aimed to train and improve the agent and subsequently train “Beatris” to give the agent the worst time, and score, possible.

俄罗斯方块在主流和软件开发者社区中是一种流行的游戏。已经创建了许多AI来击败游戏。但是，并没有制作太多的AI来使游戏对玩家来说尽可能的困难。我们对该项目的目标是创建一个恶意AI，该AI选择最难的部分来提供智能Tetris AI来玩游戏。我们将智能Tetris AI称为“代理”，将恶意游戏AI称为Beat Tetris，简称为“ Beatris”。我们从GitHub用户nuno-faria选择了一个代理实现。您可以在此处找到实现代码。他的AI使用深度强化学习和q学习来学习如何在2000场游戏中击败普通的俄罗斯方块。利用有机游戏结果，我们旨在训练和改进特工，并随后训练“ Beatris”，以使特工获得最差的时间和得分。

代理-工作原理 (The Agent — How it Works)

We first had to train the agent against the normal Tetris game to create a formidable opponent for Beatris. Here’s an explanation of its training process.

我们首先必须训练特工对抗普通的俄罗斯方块游戏，为Beatris创建一个强大的对手。这是对其训练过程的解释。

At a High-Level

在高层

Let’s start with the highest level explanation. The agent starts by randomly making choices in-game. It learns by studying the state of the board and receiving a varying numerical reward based on the move it made. It then uses a neural network at the end of every episode to train itself using the moves made and the reward received as an input to a q-learning algorithm in an attempt to maximize reward. Before we dive in to the specifics of our agent’s training process, first we must understand q-learning.

让我们从最高层次的解释开始。代理首先在游戏中随机做出选择。它通过研究董事会的状态来学习，并根据董事会的举动获得不同的数值奖励。然后，它在每个情节的结尾使用神经网络来训练自己，使用所进行的动作和所获得的奖励作为q学习算法的输入，以尝试最大化奖励。在深入研究代理人培训过程的细节之前，首先我们必须了解q学习。

Q-Learning

Q学习

Q-learning is a learning algorithm that tries to find the best action in response to the current state. In our case, the action is a placement of the given piece and the state is the current board’s layout. It finds the right actions to take based on taking random actions at first, receiving a total reward at the end of its actions, and then attempting to maximize reward.

Q学习是一种学习算法，它试图根据当前状态找到最佳动作。在我们的例子中，动作是给定棋子的放置，状态是当前棋盘的布局。它首先基于随机行动，在行动结束时获得总奖励，然后尝试最大化奖励，从而找到要采取的正确行动。

During training, a q-table is created. The q-table’s axes are state and action and all values are initialized the first time a state and corresponding action are chosen. The q-table becomes a reference for the agent to select the best action for a state based on the corresponding q-value.

在训练期间，将创建一个q表。 q表的轴为状态和动作，所有值在首次选择状态和相应动作时都会初始化。 q表成为代理基于相应q值为状态选择最佳动作的参考。

The agent can take action in two ways. It can either use the q-table and look at the q-value for every possible action for the current state and take the action corresponding to the highest q-value, or it can take a random action. Selecting a random action allows the agent to find new states that may not be found by using the existing q-table.

代理可以采取两种方式采取行动。它可以使用q表并查看当前状态的每个可能动作的q值，并采取与最高q值相对应的动作，也可以采取随机动作。选择随机动作可使代理查找使用现有q表无法找到的新状态。

In our case, the q-table is updated after every finished game with the following equation, Q_state = reward + discount × Q_next_state, where discount is a multiplier between 0.8 and 0.9 that discounts the future value of next state rewards it is considering. We made reward dependent upon (lines cleared)² and bumpiness.

在我们的例子中，在每个完成的游戏之后，使用以下等式更新Q表： Q_state = 奖励 + 折扣 × Q_next_state ，其中Discount是0.8到0.9之间的乘数，使正在考虑的下一个状态奖励的未来价值折扣。我们根据(清除的线条)²和颠簸做出奖励。

After training is done, the q-table is finished and the agent can then use the q-table in future games to respond to all possible states. For example, knowing the given piece and the state of the board, the agent will perform every action on the given piece with the given state and then reference the q-values for every resulting state, finally taking the action corresponding to the highest q-value. If you want a more in-depth explanation of q-learning before you continue, check out this article.

训练完成后，q表完成，然后座席可以在以后的游戏中使用q表来响应所有可能的状态。例如，了解给定的棋子和棋盘的状态后，座席将对具有给定状态的棋子执行每个动作，然后为每个结果状态引用q值，最后采取与最高q-对应的动作值。如果你想Q学习更深入的解释，然后再继续，看看这个文章。

Game States

游戏状态

For our application of q-learning, the agent evaluates the state of the board based on four heuristics (think of them as features) that it uses to train its behavior upon. Those are:

对于我们的q学习应用程序，代理基于其用来训练其行为的四种启发式方法(认为它们是功能)来评估板的状态。那些是：

Lines cleared
清除行
Number of holes (empty spaces completely surrounded by already played pieces)
Kong数(完全由已经演奏过的棋子包围的空白空间)
Bumpiness (the sum of the difference between the heights of adjacent columns)
凹凸(相邻列的高度之间的差之和)
Total height
总高度

Agent Hyperparameters

代理超参数

The agent also takes in a couple of hyperparameters: episodes and epsilon_stop_episode. The number of episodes determines the number of games the agent will play to train. The epsilon_stop_episode parameter determines at which episode, or game attempt, the agent should stop making any random decisions for the rest of the training episodes. In order to control this rate of random decision making, the agent keeps track of an epsilon value. Epsilon is a value between 0 and 1 that determines what percentage of the decisions the agent makes in placing pieces are totally random or deliberate, 1 being every decision made is random and 0 being all decisions are deliberate. Epsilon starts at 1 and is uniformly decremented at every episode from the first episode to the stop episode so that the agent makes more and more deliberate decisions as it trains. For example, if episodes is chosen to be 1000, and stop episode is chosen to be 800, then at episode 1 the agent will make all random decisions with epsilon 1 and starting at 800 it will make all deliberate decisions with epsilon 0. The random choice aspect allows the agent to continue to discover new actions to execute at different states that perform even better than its prior knowledge.

该代理还接受几个超参数：情节和epsilon_stop_episode。情节数量决定了代理商将要训练的游戏数量。 epsilon_stop_episode参数确定代理商应在哪个情节或游戏尝试中停止对其余训练情节做出任何随机决定。为了控制这种随机决策的速度，代理跟踪ε值。 Epsilon是介于0和1之间的值，该值确定代理在放置零件时做出的决策的百分比完全是随机的还是故意的，1是每个决策都是随机的，0是所有决策都是故意的。 Epsilon从1开始，并且在从第一集到停止集的每个情节中统一减少，以便特工在训练时做出越来越多的深思熟虑的决定。例如，如果情节被选择为1000，终止情节被选择为800，则在情节1中，代理将使用epsilon 1做出所有随机决策，而从800开始，它将使用epsilon 0做出所有故意决策。选择方面允许代理继续发现在不同状态下执行的新动作，这些新动作的执行情况甚至比其先验知识还要好。

So now we know how the agent evaluates the playing field state and what hyperparameters we can use to influence the training. But how does the agent evaluate its own performance in order to improve? That’s where the scoring function comes in.

因此，现在我们知道了代理如何评估运动场状态以及可以使用哪些超参数来影响训练。但是，代理商如何评估自己的绩效以提高绩效呢？这就是计分功能的所在。

Scoring

计分

Initially, we had one scoring function that doubled as the reward (training score the agent aims to maximize) and a metric we could use to evaluate the performance of the agent. That scoring function was primarily based on (lines cleared)²*board_width, where lines cleared measures how many lines were cleared in the last piece placement for the most recent round. This worked fine at first when we would tune state features and hyperparameters to improve agent performance. However, we found that the most significant change in agent behavior came when we changed the reward function to include various parameters, such as bumpiness. The problem with changing the reward function is that agents with different reward functions could not be compared with the final score since the function making that score differed. So, we created a normalized score that could be used to evaluate the performance of agents against each other based solely on (lines cleared)².

最初，我们有一个得分函数，该得分函数是奖励的两倍(代理商旨在最大化的培训分数)，以及可用于评估代理商绩效的指标。该评分功能主要基于(清除的行数)²* board_width，其中清除的行数衡量了最近一轮在最后一块放置中清除的行数。首先，当我们调整状态功能和超参数以提高代理性能时，此方法就可以很好地工作。但是，我们发现，当我们将奖励函数更改为包括各种参数(例如颠簸)时，代理商行为的最显着变化就来了。更改奖励功能的问题在于，具有不同奖励功能的座席无法与最终得分进行比较，因为做出该得分的功能有所不同。因此，我们创建了一个标准化的分数，该分数可仅基于(清除的行)²用于评估代理之间的性能。

调整代理性能 (Tuning Agent Performance)

Our agent’s initial performance after our first training session resulted in an average testing score of 16, meaning 4 lines cleared. After reading nuno-faria’s README file on GitHub and seeing the amazing performance of his agent that seemingly doesn’t lose, we were sorely disappointed at 4 lines cleared. We were also confused why our agent was performing so differently than nuno-faria’s. So we set off to find ways to improve the agent’s performance by altering the training process.

我们的代理在第一次培训后的最初表现是平均测试成绩为16，这意味着清除了4条线。在GitHub上阅读nuno-faria的README文件并看到他的代理的惊人性能似乎没有损失之后，我们对清除的4行感到非常失望。我们也感到困惑，为什么我们的经纪人的表现与努诺法里亚人的表现如此不同。因此，我们着手寻找通过更改培训过程来提高业务代表绩效的方法。

Our initial instinct in tuning our agent’s performance was to modify the input features (state parameters) and the hyperparameters. After experiencing only marginal improvement, we experimented with changing the training scoring (reward) function to include more heuristics like number of holes and maximum bumpiness with more success. The following sections explain in-depth our attempts to improve the agent’s performance.

调整代理性能的最初目的是修改输入功能(状态参数)和超参数。在仅经历了少量改进之后，我们尝试了更改训练得分(奖励)功能，以包括更多的启发式方法(例如Kong数和最大颠簸)，从而获得更大的成功。以下各节深入介绍了我们改进代理性能的尝试。

Tuning Hyperparameters

调整超参数

The hyperparameters we played with included the number of training episodes, epsilon stop episode, and size of the DQN. By increasing the number of episodes, we hoped the agent would have more games to train on — the equivalent of increasing the size of a dataset when training a linear model — and therefore recognize state/action patterns increasingly well. According to nuno-faria and his graph shown below, the agent’s score should have spiked around 1400 episodes (training games played) and increase very fast after that. However, changing episodes from 2000 to 10000 produced negligible results for us.

我们使用的超参数包括训练次数，epsilon停止次数以及DQN的大小。通过增加情节数量，我们希望代理可以训练更多的游戏(相当于训练线性模型时增加数据集的大小)，因此可以更好地识别状态/动作模式。根据nuno-faria及其下图所示，该特工的分数应该飙升到1400集左右(进行了训练比赛)，并且此后的增幅非常快。但是，将情节从2000年更改为10000对我们产生的影响微不足道。

We also tried playing around with the epsilon stop episode. We noticed that the performance of the agent would spike around the epsilon stop episode and then inexplicably decrease significantly afterwards. However, no different epsilon stop episodes resulted in better performance.

我们还尝试播放epsilon停止情节。我们注意到，代理的性能会在ε停止事件发作前达到峰值，然后莫名其妙地下降。但是，没有其他的ε停止现象会导致更好的性能。

Adding New Features

增加新功能

Next, we tried adding new state features. Upon observing Tetris world championship games, we noticed that professional players use various techniques to make the Tetris board appear “smooth”. This equates to having low bumpiness, low height, and minimizing the number of holes. We first added maximum bumpiness, the biggest height difference between any two columns on the board, to the feature set. However, adding this new feature did not show any improvement in the score. We also tried removing the number of lines cleared from the feature set. Unfortunately, this did not lead to a higher score either. So, we decided to stick to our original state feature set.

接下来，我们尝试添加新的状态功能。观察《俄罗斯方块》世界冠军赛后，我们注意到专业玩家使用各种技术使《俄罗斯方块》棋盘显得“流畅”。这等同于具有低的颠簸性，低的高度以及最小化Kong的数量。我们首先向功能集添加了最大颠簸度，即板上任意两列之间的最大高度差。但是，添加此新功能并未显示分数的任何改善。我们还尝试删除了从功能集中清除的行数。不幸的是，这也没有导致更高的分数。因此，我们决定坚持使用原始状态功能集。

A Heavier Death Penalty

较重的死刑

When logging in values into the q-table, the initial implementation of the agent reduced the reward by 2 if it loses. Since we really wanted to discourage death, or loss of the game, we tried punishing the agent by reducing reward by 200 with loss. Adding a heavier death penalty improved the score considerably. Thus, we ended up incorporating this technique in our final agent.

当将值登录到q表中时，代理的初始实现将损失的奖励减少2。由于我们确实想阻止死亡或游戏损失，因此我们尝试通过将损失减少200来奖励代理商。增加较重的死刑可以大大提高得分。因此，我们最终将此技术结合到了最终代理中。

Improved Reward Function

改进的奖励功能

We noticed that the agent would leave the leftmost and rightmost columns empty even if it did receive a piece that would fit well. To encourage a “smoother” pattern, we added a negatively weighted maximum bumpiness term to the reward. With this tactic, the average score improved from 16 to about 135 and the maximum score improved from 60 to about 316,000, so we kept negative maximum bumpiness in the reward function for our final agent. Still, the agent created a considerable number of holes. We added a similar negative term to account for the number of holes in the reward function. However, it did not increase the agent’s score. The final reward function is shown below. Notice how for every round the agent survives, it gains at a baseline of 1 reward, increased if any lines are cleared and decreased based on bumpiness and more so for the game ending.

我们注意到，即使代理收到确实适合的部分，它也会将最左边和最右边的列留空。为了鼓励“更顺畅”的模式，我们在奖励中添加了负加权的最大颠簸项。通过这种策略，平均得分从16提高到大约135，最大得分从60提高到大约316,000，因此我们在最终代理商的奖励函数中保持负的最大颠簸度。尽管如此，该代理仍然创造了大量漏洞。我们添加了一个类似的负项来说明奖励函数中的漏洞数。但是，它并没有增加座席的分数。最终的奖励功能如下所示。请注意，特工如何在每一轮中幸存下来，都会以1个奖励的基准获得收益，如果清除任何行并根据颠簸而减少，则收益会增加，并且在游戏结束时会更多。

Final Agent Technique and Performance

最终代理技术和性能

Our final agent had an average score of 135 and a maximum score of over 300,000. We kept the original four state features, bumpiness, lines cleared, holes, and height. Our final reward function was based on (lines cleared)² and bumpiness.

我们的最终代理商平均得分为135，最高得分超过300,000。我们保留了原始的四个状态特征，即颠簸，线条清晰，空洞和高度。我们最终的奖励功能基于(清除的线条)²和颠簸。

碧翠丝 (Beatris)

Existing “Beatris-like” Implementations

现有的“类似于Beats的”实现

We were inspired to build Beatris after discovering Hatetris, an adversarial network that chooses the next piece to give the player based on the following algorithm: test all possible placements of all possible pieces, determine which piece has the worst best-case placement scenario, and spawn that piece. The metric to choose which scenario is the worst is simply max height of the field after piece placement.

在发现Hatetris之后，我们受到启发来构建Beatris，这是一个对抗网络，它根据以下算法选择下一块给玩家：测试所有可能棋子的所有可能放置，确定哪一块棋子在最佳情况下的表现最差，以及产生那一块。选择哪种情况最糟糕的度量标准只是放置碎片后的最大视野高度。

However, this strategy is extremely time-consuming and does not use any machine learning models to make a decision. We wanted to make an even more “evil” AI. The highest score attained against Hatetris is 30 lines cleared. In our implementation, we wanted to make use of more advanced machine learning techniques and improve (reduce) the score a user could get against it.

但是，此策略非常耗时，并且不使用任何机器学习模型来做出决策。我们想做一个更加“邪恶”的AI。对Hatetris的最高分数是清除了30条线。在我们的实现中，我们想利用更先进的机器学习技术并提高(降低)用户可获得的分数。

How Beatris Works

Beatris如何运作

Beatris is an adversarial model based on a deep q-learning network similar to the original Tetris agent. However, for every move that the agent makes, Beatris logs the agent’s move with a negative reward. In other words, Beatris is punished for many of the same things the agent is rewarded for. Find our implementation here. Below is Beatris’ reward function. For every round the agent survives, Beatris receives a baseline negative reward of 1 and is further punished for any lines cleared by the agent. It is incentivized to increase bumpiness and holes and is most heavily incentivized to end the game.

Beatris是一种基于深度q学习网络的对抗模型，类似于原始的Tetris代理。但是，对于代理人的每一个举动，Beatris都会记录代理人的举动并获得负回报。换句话说，比阿特丽斯(Beatris)因代理人获得的许多相同奖励而受到惩罚。在此处找到我们的实现。以下是Beatris的奖励功能。对于特工生存的每一回合，Beatris都会获得基准负奖励1，并会因特工清除的任何线路而受到进一步惩罚。鼓励增加颠簸和漏洞，并鼓励他们结束比赛。

Initial Beatris Performance

最初的Beatris表演

When we tested our initial agent with an average score of 16 with Beatris, we were able to observe a new average score of 13. What is more interesting is that Beatris was able to stifle the maximum score to 19 from 60. Against the improved agent, Beatris lowered the averaged score from 135 to 50. However, we did run into interesting situations when the models interacted and wanted to improve Beatris’ performance. We first attempted to fix the infinite loop problem.

当我们用Beatris测试最初的代理人的平均分数为16时，我们可以观察到新的平均分数13。更有趣的是，Beatris能够将最高分数从60抑制到19。，Beatris将平均得分从135降低到50。但是，当模型进行交互并希望提高Beatris的性能时，我们确实遇到了有趣的情况。我们首先尝试解决无限循环问题。

The Infinite Loop Problem

无限循环问题

We noticed that the training process always got stuck at a certain point. By rendering the game more often, we were able to pinpoint the cause. The game got stuck in an infinite loop where Beatris would provide the same piece over and over again, helping the agent clear the same lines. We realized that Beatris was being too greedy in its decision to give the agent a certain piece. Upon adding the number of holes to Beatris’ reward function, we incentivized Beatris to place more holes on the Tetris board, hoping that would stop the infinite loop. It worked, and we were able to reduce the agent’s average score to 18.

我们注意到培训过程总是卡在某个特定的位置。通过更频繁地渲染游戏，我们可以查明原因。游戏陷入了一个无限循环，在这个循环中，Beatris会一遍又一遍地提供相同的片段，从而帮助特工清除相同的线条。我们意识到比阿特丽斯(Beatris)过于贪婪地决定给经纪人某件东西。将漏洞数量添加到Beatris的奖励功能中后，我们激励Beatris在俄罗斯方块板上放置更多漏洞，希望这会终止无限循环。它奏效了，我们能够将特工的平均得分降低到18。

最终表现 (Final Performance)

While our original goal was to make adversarial Tetris AI to be able to play against, our AI vs. AI training process made this project two-pronged: creating the agent and Beatris. With regards to the agent, we hoped we would have an almost unbeatable Tetris AI off the bat. However, we had to spend half of our project time on training and tuning our agent to an acceptable level of performance, increasing average score from 16 to 135. With Beatris, once we decided upon training using a negative reward deep q-learning network, it was relatively straightforward to create problems for our agent, especially since we understood the q-learning process and our domain much better after having worked on improving agent performance. When all was said and done, the final agent average score decreased from 135 to 18 when playing against Beatris, a huge improvement from earlier impact of 135 to 50. The agent’s maximum score was stifled from 316,000 to 45!

虽然我们最初的目标是使对抗性Tetris AI能够与之抗衡，但我们的AI vs. AI培训过程使该项目有两个方面：创建代理和Beatris。关于代理，我们希望我们能拥有几乎无与伦比的Tetris AI。但是，我们不得不将项目时间的一半用于培训和调整代理到可接受的性能水平，从而将平均得分从16提高到135。有了Beatris，一旦我们决定使用负奖励的深度q学习网络进行培训，为代理人创建问题相对容易，尤其是因为我们在改进代理人性能之后更加了解q学习过程和我们的领域。一言以蔽之，当与比阿特里斯对战时，最终特工的平均得分从135降低到18，从之前的135到50的巨大飞跃。特工的最高得分从316,000抑制为45！

Some things we wished we could have tried if we had more time include:

我们希望有更多时间可以尝试的一些事情包括：

Comparing agent performance against Beatris and against Hatetris
比较经纪人与Beatris和Hatetris的表现
Continuing training on agent until it can play indefinitely
继续对特工进行培训，直到可以无限期地进行比赛
Training Beatris against human play
训练Beatris对抗人类玩耍
Create a smoother human playable interface
创建更流畅的人类可玩界面