强化学习的进化策略

本文探讨了在强化学习中,尽管优先体验回放在某些环境中能有效推广,但在特定案例中,增加随机性(如epsilon贪婪策略)可能更有助于问题解决。作者提出疑问,是否应尝试完全随机的行为策略来提升算法性能。
摘要由CSDN通过智能技术生成

In the last article, the goal that we set to ourselves was to optimize the Deep Q-Learning with prioritized experience replay, in other words, provide the algorithm with a bit of help judging what is important and should be remembered and what is not. Most globally, under current technological achievements, algorithms tend to perform better when helped by human intervention. Take the example of image recognition, and let’s say you want to classify apples and bananas. Your algorithm would definitely be more accurate with the prior knowledge that bananas are yellow than if it has to learn by itself. This could also be translated by over engineering a set of hyper-parameters which would only optimize a very specific task. As for reinforcement learning, a way to prove algorithms can be generalized is to test it with multiple environments to solve. This is exactly why OpenAI environments have been made, allowing researchers to test out their algorithms by providing a simple interface that enables us to transition between environments very easily.

在上一篇文章中 ,我们设定的目标是通过优先级体验重放来优化Deep Q学习,换句话说,为该算法提供了一些帮助,以帮助您判断哪些是重要的,应该记住的,哪些不是必须的。 在全球范围内,根据当前的技术成就,算法在人工干预的帮助下往往会表现得更好。 以图像识别为例,假设您要对苹果和香蕉进行分类。 如果先验知识香蕉是黄色的,那么您的算法肯定比必须自己学习的算法更为准确。 这也可以通过对一组超参数进行过度工程来翻译,这些超参数只会优化非常具体的任务。 对于强化学习,证明算法可以推广的一种方法是在多种环境下进行测试以解决问题。 这正是制作OpenAI环境的原因,它允许研究人员通过提供一个简单的界面来测试他们的算法,该界面使我们能够非常轻松地在环境之间进行转换。

Now about prioritized experience replay, the publication showed that it could generalize well through most of the environments, yet this little human intervention help did not benefit in our case. After all, maybe this environment does need more randomness to be able to be solved. In the case of Deep Q-Learning, the randomness is created by using an epsilon greedy policy, which is “How about we don’t take the most optimized action from time to time and see what happens”. This is also called exploration. But this is actually a very pale random behavior since based on a stochastic (probabilistic) policy. Uh. In that case, how about making it totally random?

现在关于优先体验回放,该出版物表明它可以在大多数环境中很好地推广,但是这种少量的人工干预对我们的案例没有帮助。 毕竟,也许这个环境确实需要更多的随机性才能解决。 在深度Q学习中,随机性是通过使用epsilon贪婪策略创建的,该策略是“如何不时不采取最优化的行动,看看会发生什么”。 这也称为探索。 但这实际上是一种非常苍白的随机行为,因为它是基于随机(概率)策略的。 嗯 在这种情况下,如何使其完全随机?

Image for post

Reinforcement learning randomness cooking recipe:

强化学习随机性烹饪食谱:

  • Step 1: Take a neural network with a set of weights, which we use to transform an input state into a corresponding action. By taking successive actions guided by this neural network, we collect and add up each successive rewards until the experience is complete.

    步骤1:采用具有一组权重的神经网络,我们将其用于将输入状态转换为相应的动作。 通过在该神经网络的指导下采取连续的动作,我们收集并累加每个连续的奖励,直到体验完成为止。
  • Step 2: Now add the randomness: from this set of weights, generate another set of weights by adding random noise into the original weights parameters, which is, modify them a bit using a sampling distribution, for example a Gaussian distribution. Sample a new experience and collect the total reward.

    步骤2:现在添加随机性:从这组权重中,通过将随机噪声添加到原始权重参数中来生成另一组权重,也就是使用采样分布(例如高斯分布)对其进行一些修改。 尝试新的体验并收集总奖励。
  • Step 3: Repeat the random sampling of weight parameters until you achieve the desired score.

    步骤3:重复权重参数的随机抽样,直到获得所需分数。

This is the most random thing you could do. Pull out a random neural network with random weights and see if that works, if not try again. Well the truth behind that is, this is very unlikely to work. (Or at least work in a reasonable amount of time). Yet, some very promising papers which have been published in recent years and provide very competitive results in tasks such as making humanoids robots learn how to walk are not so far from applying this very basic cooking recipe.

这是您可以做的最随机的事情。 拉出具有随机权重的随机神经网络,看看是否可行,如果不重试。 事实真相是,这不太可能起作用。 (或至少在合理的时间内工作)。 然而,近年来发表的一些非常有前途的论文在诸如使类人机器人学习如何走路等任务中提供了非常有竞争力的结果,并没有应用这种非常基本的烹饪方法。

Let’s get to it. Now imagine that instead of sampling around the same initial set of weights, at each sampling iteration you compare your reward with the previous set of weight’s rewards. If the reward is better, it means your neural network has a better idea of what is the optimal policy. So you can now start from there to sample another set of weights. This process is called hill climbing.

让我们开始吧。 现在想像一下,不是在相同的初始权重集合周围进行采样,而是在每次采样迭代时,将您的奖励与以前的权重奖励进行比较。 如果奖励更高,则意味着您的神经网络对什么是最佳策略有了更好的了解。 因此,您现在可以从那里开始采样另一组权重。 此过程称为爬山。

The analogy is pretty straightforward, you are trying to optimize your total reward (which lies at the summit of the hill), and you are taking successive steps. If your step brings your closer to the top, then you very confidently start from there for the next step. Otherwise, you come back to your previous step and try another direction. It actually looks very much like gradient ascent, optimizing a function by “climbing” a function you are trying to optimize. The difference lies in the neural network update. In hill climbing, you do not back propagate to update the weights, but only use random sampling.

这个比喻非常简单,您正在尝试优化总奖励(位于山顶),并且正在采取后续步骤。 如果您的步骤将您拉近了顶端,那么您很有信心地从那里开始下一步。 否则,您将返回上一步并尝试另一个方向。 实际上,它看起来非常像渐变上升,是通过“爬升”要尝试优化的功能来优化功能。 区别在于神经网络更新。 在爬坡中,您不会向后传播以更新权重,而仅使用随机采样。

Hill climbing actually belongs to a group of algorithms called black box optimization algorithms.

爬山实际上属于称为黑盒优化算法的一组算法。

Image for post

We don’t know what exactly is the function we are trying to optimize, we can only feed it input and observe a result. Based on the output we are able to modify our input to try reaching the optimum result value. Actually eh, that sounds very familiar! Indeed, reinforcement learning algorithms also rely on a black box since they are based on an environment which provide the agent rewards (

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值