机器学习项目中遇到的难题_强化学习简介探索探索难题

机器学习项目中遇到的难题

Explore-Exploit Dilemma

探索-利用困境

Decision and dilemma are the two sides of the same coin. Imagine a student looking forward to learning data science. He searches online for data science courses and it returns a number of courses from Harvard, MIT, Coursera, Udemy, Udacity, etc. Now, here’s is the dilemma: how does he figure out which course is the best for him at the initial stage given all the courses/information? Deciding the best course after going through all of the courses outlines one by one is might be the ideal solution for him. In reinforcement learning, this is an exploration where one gathers information for assessing the scenario that may lead him to a better decision in the future. After the exploration, he may decide to learn from a specific course. This is called exploitation in reinforcement learning where one can take the optimal decisions with the highest possible outcome given current acquired knowledge or information.

决策和困境是同一枚硬币的两个方面。 想象一个学生渴望学习数据科学。 他在线搜索数据科学课程,并从哈佛,麻省理工学院,Coursera,Udemy,Udacity等那里返回了许多课程。现在,这是一个两难境地:在最初阶段,他如何确定哪门课程最适合他给出了所有的课程/信息? 在逐步完成所有课程大纲之后确定最佳课程可能是他的理想解决方案。 在强化学习中,这是一种探索 ,其中人们收集信息以评估可能导致他在将来做出更好决定的场景。 探索之后,他可以决定学习特定的课程。 这被称为强化学习中的剥削 在给定当前获得的知识或信息的情况下,可以做出具有最高可能结果的最佳决策。

Exploration is a necessary step though it is a labor-intensive and time-consuming process. In consequence, it raises questions: how long we should explore? when we should start exploiting? how much we should exploit? Investigating these questions provides us what we actually seek for: “identifying the best option to exploit at the same time exploring a sufficient number of options”. This is the explore-exploit dilemma in reinforcement learning.

尽管探索是一个劳动密集且耗时的过程,但它是必不可少的步骤。 结果,它引发了一个问题:我们应该探索多长时间? 我们什么时候应该开始开发? 我们应该利用多少? 研究这些问题为我们提供了我们真正要寻找的东西:“确定同时探索足够数量选择的最佳选择”。 这是强化学习中的探索-利用困境

Multiple Armed Bandit Problem (MABP) is a classic example of the explore-exploit dilemma where the goal is to investigate the best option among a set of options and later exploit the selected option. There are multiple machines (bandits) in the MABP where each of the bandits has a different win rate. Knowing the win rate for each of the machines is a challenge since playing multiple machines cost money and time. We only want to play or exploit the machine that has the highest possible outcome and at the same time, we should explore all the machines to find out the machine with the best win rates. Hence, the explore-exploit dilemma raised which is known as MABP.

多武装强盗问题(MABP)是探索-利用困境的经典示例,其目的是调查一组选项中的最佳选项,然后再利用选定的选项。 MABP中有多台机器(土匪),每个土匪的胜率都不相同。 知道每台机器的获胜率是一个挑战,因为玩多台机器需要花费金钱和时间。 我们只想发挥或利用具有最高可能的结果的机器,并在同一时间,我们应该探讨所有的机器,找出机器最好的获胜率。 因此,提出了探索与利用的困境,即MABP。

The explore-exploit dilemma exists almost everywhere from business to our day-to-day life activities. Which movie/TV series on Netflix or Amazon should we watch — is a real-life example of the dilemma. When I open Netflix or Amazon Prime, I generally start exploring different movies/TV shows to find an enjoyable one to watch or exploit (Unfortunately, often I lost interest in watching after a long period of exploring!!!). Another famous example is that should we go to our favorite restaurant or try a new one.

从业务到我们的日常生活活动,探索-利用困境几乎无处不在。 我们应该在Netflix或Amazon上看哪部电影/电视剧,这是两难困境的真实例子。 当我打开Netflix或Amazon Prime时,我通常会开始浏览不同的电影/电视节目,以寻找可以观看或利用的令人愉快的电影/电视节目(不幸的是,经过长时间的探索,我常常对观看失去了兴趣!!!) 另一个著名的例子是我们应该去我们喜欢的餐厅还是尝试新餐厅。

How is the explore-exploit dilemma addressed?

探索-利用困境 如何 解决?

Now, we know the dilemma, the next step is — addressing the dilemma in actual reinforcement learning algorithms. There are multiple strategies that address the dilemma to investigate the optimal solution with the highest possible outcome:

现在,我们知道了难题,下一步是-解决实际的强化学习算法中的难题。 有多种策略可以解决这个难题,以研究具有最高可能结果的最佳解决方案:

Epsilon-Greedy Method

Epsilon-Greedy方法

Epsilon-Greedy ( ε-greedy)is the most common and simplest algorithm for balancing the trade-off between exploring and exploiting by choosing them randomly. Assume that we explore n options at the initial stage. Then, by the ε-greedy algorithm, (1- ε) percent time we greedily exploit the best option k among the n options, and the remaining e percent time other options are randomly explored for a better decision than the previously best option k. The value of ε is set to 10%.

Epsilon-Greedy(ε-greedy)是最常见,最简单的算法,可通过随机选择来平衡探索和利用之间的权衡。 假设我们探索 ñ 初期的选择。 然后,通过ε-贪心算​​法,我们贪婪地利用(1- ε ) %的时间在n个选项中挖掘最佳选项k ,并随机探索剩余的e %的时间其他选项以获得比先前最佳选​​项k更好的决策 ε的值 设置为10%。

Epsilon Decreasing Method

Epsilon递减法

Epsilon deceasing is similar to the ε-greedy method. The ε in the ε-greedy method is remained fixed, while in the epsilon decreasing method the ε value gradually decreases over time. The number of exploring new options gradually decreases with the decreasing of the value ε and meaning that the best option becomes more certain in the process.

Epsilon递减类似于ε-贪心法。 ε贪心法中的ε保持固定,而ε减少法中ε值随时间逐渐减小。 随着值ε的减小,探索新选项的数量逐渐减少,这意味着最佳选项在此过程中变得更加确定。

翻译自: https://towardsdatascience.com/intro-to-reinforcement-learning-the-explore-exploit-dilemma-463ceb004989

机器学习项目中遇到的难题

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值