像drl代理一样调试生活

最新推荐文章于 2024-10-02 10:53:34 发布

weixin_26704853

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量177

点赞数

文章标签： python debug

原文链接：https://medium.com/@abogaziah/debug-your-life-like-a-drl-agent-cf34b0a6d6d

版权

This is an intersection point I found myself in after reading “the subtle art of not giving a f***” and “Reinforcement learning”

这是我读完“不给别人的微妙技巧”和“强化学习”后发现的一个交集

Suffer.

遭受。

why do I always suffer? what’s the point? and why always me? what’s wrong with my life?

为什么我总是受苦？重点是什么？为什么总是我？我的生活怎么了？

I’m Arab. These two words have given me more disadvantages than you could possibly imagine, not being Arab itself but being born in a third world neglected country, everyone judges me for it, though I didn’t choose it, though I’m facing enough problems of no eduction, no income (American dollar valued), no nothing.

我是阿拉伯人这两个词给了我比您想象的更多的劣势，不是阿拉伯人，而是出生在第三世界被忽视的国家。每个人都为我评判我，尽管我没有选择它，尽管我面临着很多问题。没有教育，没有收入(美元价值)，什么都没有。

and I started questioning why my life is so f*ed up? It got me really sad, I don’t want to say depressed because depression is more serious than that, I don’t wanna abuse it.

我开始质疑为什么我的生活如此烦躁？这让我非常难过，我不想说沮丧，因为抑郁比那更严重，我不想滥用它。

after some reading, I came to the fact that I’m not the only one who’s suffering everyone suffer and I’ll continue to suffer my whole life whether I liked it or not, and I can bitch and whine about it or I can take responsibility and figure out a way of suffering better, of suffering beautifully as Lex Fridman says.

经过一番阅读之后，我发现我不是唯一一个遭受所有人痛苦的人，无论我是否喜欢，我都会继续经历我的一生，我可以bit吟和抱怨，也可以接受责任感并找出更好的痛苦方式，就像Lex Fridman所说的那样美丽。

Imagine you’re training a reinforcement learning agent and your agent is receiving only negative rewards and then it crashes and then more negative rewards every epoch, you’ve trained it for hours, it’s not evolving. you realize something is wrong and you start debugging your agent, what could be wrong?

想象一下，您正在培训一名强化学习型业务代表，而您的业务代表仅收到负面奖励，然后崩溃，然后每个时期都有更多负面奖励，您已经训练了几个小时，而且没有发展。您意识到有问题，然后开始调试代理，这可能是什么问题？

1-您的价值功能具有误导性或肤浅性： (1- Your value function is misleading or superficial:)

imagine your agent is a racing car and your value function is to be faster than other cars in the track. at first glance, this seems like a good value function, if you’re faster than everyone else, you’re going to be first! easy!

想象您的经纪人是一辆赛车，您的价值功能是要比赛道上的其他赛车更快。乍一看，这似乎是一个很好的价值功能，如果您比其他所有人都快，那您将是第一！简单！

with this value function, your agent will rarely get to half of the track, no matter how much you train it no matter how much it suffers, it will be a failure.

有了这个价值功能，您的代理很少会走一半的路，无论您培训了多少，无论遭受了多少痛苦，这都是一次失败。

this is you when you think that you’ll be happy when I’m richer than everyone else or when I have more knowledge than everyone else (the latter was mine for so long)

当我认为自己比其他人富有或当我比其他人拥有更多的知识时，你就会感到高兴(这就是你这么久了)

it’s just stupid because it’s dependent on external events, you can’t control others’ speed, a better one would be “go as fast as possible at every point of the race” on top of that, speed isn’t the only factor despite being an actual speed race, it doesn’t matter if you’re going at 300 km/hr and end up shooting off the track, so even better one is “go as fast as possible at every point of the race while maintaining balance with accuracy”

这只是愚蠢的，因为它取决于外部事件，您无法控制他人的速度，更好的方法是“在比赛的每个点都尽可能快”，除此之外，速度并不是唯一的因素作为一场真正的速度竞赛，无论您以300 km / hr的速度跑到最后离开赛道都没关系，所以更好的方法是“在比赛的每个点都尽可能快地跑，同时保持平衡准确性”

2-您的学习算法不正确： (2- Your learning algorithm is not correct:)

when you study any DRL algorithm you also study proof that it will -after some training(or suffering in human language)- achieve its goal which is maximizing the reward function. there are no proofs in real life that you’re learning, but you’re smart enough to figure out if you’re learning or not unlike the agent. if you’re not learning from your suffering you’re not befitting form your suffering, you just take the bad half and leave out the good one, you’re going to suffer anyway so learn from it.

当您研究任何DRL算法时，您还需要研究证明，经过一定的训练(或遭受人类语言的折磨)，该算法将达到最大化奖励功能的目的。现实生活中没有证据表明您正在学习，但是您足够聪明，可以确定自己是否正在学习，是否与代理不同。如果您没有从苦难中学习，就不能从苦难中适应，那么您只选择坏的一半而忽略了好的一面，无论如何您都会遭受痛苦，因此请从中学习。

but how does that map to our life as human beings?

但是这如何映射到我们作为人类的生活？

Correct RL algorithms are around the idea of making the choice that would maximize the total reward, so should you. if you wake up one day, look in the mirror and ask yourself will what I’m about to do today maximize my reward function? will I receive a positive reward for it?and by reward, I don’t mean money or fame, I mean a human reward. Luckily human beings come with a preinstalled reward system. and if you listen to it, if the algorithm you wake up every day to execute is to maximize this total reward, that’s when you suffer beautifully.

正确的RL算法围绕着做出使总奖励最大化的选择的想法，因此您也应该这样做。如果您有一天醒来，对着镜子，问问自己，我今天要做什么将最大化我的奖励功能？我会为此获得积极的回报吗？通过奖励，我的意思不是金钱或名望，而是人类的奖励。幸运的是，人类带有预装的奖励系统。如果您聆听它，如果您每天醒来执行的算法是要使总奖励最大化，那就是您遭受痛苦的时候。

and when I say total reward, I mean total reward, part of the challenges of DRL is the inability to predict the future, but you can.

当我说总奖励时，我的意思是总奖励，DRL的部分挑战是无法预测未来，但是您可以。

drinking or drugs will greatly boost your reward system, but is it going to maximize your total reward? how much negative reward are you going to receive if you end up addicted? this is called a “greedy” algorithm, to go for the short term big reward and depend on luck after that, maybe you’ll be an alcoholic maybe not. most of the time you do, and it’s proven not to be a correct algorithm.

喝酒或吸毒会大大提高您的奖励系统，但这会最大化您的总奖励吗？如果您上瘾了，您将获得多少负面奖励？这就是所谓的“贪婪”算法，为了获得短期的丰厚回报，然后依靠运气，也许您会成为酗酒者，也许不是。大部分时间都是这样做的，事实证明这不是正确的算法。

3-也许环境太复杂了？ (3- Maybe the environment is too complicated?)

if you’re sure that the value function is acceptable and your algorithm is correct and the agent is still not reaching the goal, then what’s wrong?not all agents take the same training time, also not all humans suffer equally. some agents train on a track that’s just a straight line, others train on a track that’s full of sharp turns and going up and down. while agents in a straight line train less they are less intelligent agents, if you test them on a slightly more complicated track they f* up really bad. however, it’s better for more intelligent agents to start on a simple track then move gradually to more complicated ones.

如果您确定值函数可以接受并且您的算法正确并且代理仍然没有达到目标，那么出什么问题了？不是所有代理都花费相同的训练时间，也不是所有人都遭受同样的痛苦。有些特工在一条直线上训练，另一些特工在充满急转弯和上下运动的轨道上训练。虽然直线上的座席训练较少，但他们的智能程度较差；如果在稍微更复杂的轨道上对其进行测试，它们会变得非常糟糕。但是，最好是让更智能的代理从简单的路线开始，然后逐步过渡到更复杂的代理。

so if you’re choosing good values to follow and you’re actually following them and your reward system is still giving you negative reward, give yourself some time, the more complicated your track the more suffering it takes, maybe take yourself to a less complicated track if you have a choice, you’ll receive less valued reward than the complicated track but at least you will begin getting some positive rewards that will teach you what to do, fewer choices mean a higher probability of getting the right one, and when you move back to the complicated track, you will have the knowledge to choose based on, however, if you don’t have that chance, you should believe that you were given that track for a reason, “You have to trust in something — your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life.” -Steve Jobs. yes, you’re in a harder environment than other agents, but this is a chance, a chance to suffer more and learn more and be a better, smarter agent than those in a simpler environment.

因此，如果您选择要遵循的好的价值观，而您实际上是在遵循这些价值观，而您的奖励体系仍在给您负面奖励，请给自己一些时间，您的足迹越复杂，需要付出的苦难就越多，也许可以减少自己的痛苦。如果您有选择，那么复杂的道路会比复杂的道路获得更少的价值报酬，但至少您将开始获得一些正面的奖励，可以教会您如何做，更少的选择意味着获得正确选择的可能性更高，并且当您回到复杂的轨道时，您将有足够的知识来选择，但是，如果没有机会，您应该相信获得该轨道的原因是：“您必须信任某些东西-您的直觉，命运，生活，业障等等。这种方法从未让我失望，它改变了我的生活。” -史蒂夫·乔布斯。是的，您所处的环境比其他代理更艰苦，但这是一个机会，比更简单的环境中的代理，这是一个遭受更多痛苦，学习更多并成为更好，更聪明的代理的机会。

it’s funny how this debugging process is similar to human life. There is a saying that DRL implementations fail silently, the agent won’t tell you “what is this stupid value function you gave me” or “there’s a bug in your implementation” it’ll just train with what you gave it and you won’t even know if there’s something wrong or is it just taking time to learn about the environment, much like our lives, you think your values are great until you’re dying on your bed and you think “oh God! what a waste my life was, I wish I did better”.

有趣的是，此调试过程与人类的生活相似。有一种说法是DRL实现会以无提示方式失败，该代理不会告诉您“您给我的这个愚蠢的价值函数是什么”或“实现中存在错误”，它只会训练您提供的内容，您就可以赢了。甚至不知道是不是有问题，还是只是花时间来了解环境，就像我们的生活一样，您认为自己的价值观很伟大，直到您死在床上，然后才想到“天哪！我的生活真是浪费，我希望自己做得更好。”

but what is the metric here? we create an agent for a purpose and we say this agent is better than that agent if it can achieve that purpose in a smarter, faster, or cheaper way, what about humans?

但是这里的指标是什么？我们创建某个目的的代理，并且说如果该代理能够以更智能，更快或更便宜的方式实现该目的，那么它会比该代理更好。

that’s where human knowledge branches most, at figuring out “what is the meaning of life?”that will be the subject of my next story :)

那是人类知识最重要的领域，在弄清楚“生命的意义是什么？”这将是我下一个故事的主题：)

Peace out ✌

和平out