强化学习应用于组合优化问题_如何将强化学习应用于现实生活中的计划问题

强化学习应用于组合优化问题

by Sterling Osborne, PhD Researcher

作者:斯特林·奥斯本(Sterling Osborne),博士研究员

如何将强化学习应用于现实生活中的计划问题 (How to apply Reinforcement Learning to real life planning problems)

Recently, I have published some examples where I have created Reinforcement Learning models for some real life problems. For example, using Reinforcement Learning for Meal Planning based on a Set Budget and Personal Preferences.

最近,我发布了一些示例,其中我针对一些现实生活中的问题创建了强化学习模型。 例如,将强化学习用于基于计划预算和个人偏好的膳食计划

Reinforcement Learning can be used in this way for a variety of planning problems including travel plans, budget planning and business strategy. The two advantages of using RL is that it takes into account the probability of outcomes and allows us to control parts of the environment. Therefore, I decided to write a simple example so others may consider how they could start using it to solve some of their day-to-day or work problems.

强化学习可以通过这种方式用于各种计划问题,包括旅行计划,预算计划和业务策略。 使用RL的两个优点是它考虑了结果的可能性,并允许我们控制环境的某些部分。 因此,我决定写一个简单的示例,以便其他人可以考虑如何开始使用它来解决他们的一些日常或工作问题。

什么是强化学习? (What is Reinforcement Learning?)

Reinforcement Learning (RL) is the process of testing which actions are best for each state of an environment by essentially trial and error. The model introduces a random policy to start, and each time an action is taken an initial amount (known as a reward) is fed to the model. This continues until an end goal is reached, e.g. you win or lose the game, where that run (or episode) ends and the game resets.

强化学习(RL)是通过本质上的反复试验来测试哪种操作最适合环境的每个状态的过程。 该模型引入了一个随机策略来启动,并且每次采取行动时,都会向该模型提供初始金额(称为奖励)。 这一直持续到达到最终目标为止,例如,您赢了或输了游戏,游戏结束(或情节)并重置了游戏。

As the model goes through more and more episodes, it begins to learn which actions are more likely to lead us to a positive outcome. Therefore it finds the best actions in any given state, known as the optimal policy.

随着模型经历越来越多的事件,它开始了解哪些行动更有可能导致我们取得积极的结果。 因此,它会在任何给定状态下找到最佳操作,称为最佳策略。

Many of the RL applications online train models on a game or virtual environment where the model is able to interact with the environment repeatedly. For example, you let the model play a simulation of tic-tac-toe over and over so that it observes success and failure of trying different moves.

许多RL应用程序在线地在游戏或虚拟环境中训练模型,其中模型能够与环境反复交互。 例如,您让模型反复模拟井字游戏,以便观察尝试不同动作的成功和失败。

In real life, it is likely we do not have access to train our model in this way. For example, a recommendation system in online shopping needs a person’s feedback to tell us whether it has succeeded or not, and this is limited in its availability based on how many users interact with the shopping site.

在现实生活中,我们很可能无法以这种方式训练模型。 例如,在线购物中的推荐系统需要一个人的反馈来告诉我们它是否成功,并且基于有多少用户与购物网站进行交互,其可用性受到限制。

Instead, we may have sample data that shows shopping trends over a time period that we can use to create estimated probabilities. Using these, we can create what is known as a Partially Observed Markov Decision Process (POMDP) as a way to generalise the underlying probability distribution.

取而代之的是,我们可能有一些示例数据显示了一段时间内的购物趋势,可以用来创建估计的概率。 使用这些,我们可以创建所谓的部分观测的马尔可夫决策过程(POMDP),作为概括潜在概率分布的一种方法。

部分观测的马尔可夫决策过程(POMDP) (Partially Observed Markov Decision Processes (POMDPs))

Markov Decision Processes (MDPs) provide a framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. The key feature of MDPs is that they follow the Markov Property; all future states are independent of the past given the present. In other words, the probability of moving into the next state is only dependent on the current state.

马尔可夫决策过程(MDP)提供了一个框架,用于在结果部分随机且部分受决策者控制的情况下对决策建模。 MDP的关键特征是它们遵循Markov属性。 所有未来状态都独立于过去给出的当前状态。 换句话说,进入下一个状态的概率仅取决于当前状态。

POMDPs work similarly except it is a generalisation of the MDPs. In short, this means the model cannot simply interact with the environment but is instead given a set probability distribution based on what we have observed. More info can be found here. We could use value iteration methods on our POMDP, but instead I’ve decided to use Monte Carlo Learning in this example.

POMDP的工作方式类似,只是它是MDP的概括。 简而言之,这意味着该模型不能简单地与环境交互,而是根据我们观察到的结果给定一个固定的概率分布。 更多信息可以在这里找到。 我们可以在POMDP上使用值迭代方法,但是我决定在此示例中使用“蒙特卡洛学习”。

示例环境 (Example Environment)

Imagine you are back at school (or perhaps still are) and are in a classroom, the teacher has a strict policy on paper waste and requires that any pieces of scrap paper must be passed to him at the front of the classroom and he will place the waste into the bin (trash can).

想象一下,您回到学校(或可能仍然在教室里),老师对废纸有严格的政策,要求任何废纸必须在教室前面传递给他,他将把废物进入垃圾箱(垃圾桶)。

However, some students in the class care little for the teacher’s rules and would rather save themselves the trouble of passing the paper round the classroom. Instead, these troublesome individuals may choose to throw the scrap paper into the bin from a distance. Now this

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值