reinforcement learning，增强学习：Exploration and Exploitation

最新推荐文章于 2024-04-14 00:09:44 发布

mmc2015

最新推荐文章于 2024-04-14 00:09:44 发布

阅读量2.2k

点赞数

分类专栏：（深度）增强学习文章标签： reinforcement learni 增强学习 Exploration and Expl

本文链接：https://blog.csdn.net/mmc2015/article/details/53466943

版权

（深度）增强学习专栏收录该内容

40 篇文章 9 订阅

订阅专栏

最后两节课分别将bandits和games，基本上是保证课程的完整性，很多内容比较复杂，这里只提一些思想。

Lecture 9: Exploration and Exploitation

Online decision-making involves a fundamental choice:
Exploitation Make the best decision given current information
Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions
然而问题是：

If an algorithm forever explores it will have linear total regret
If an algorithm never explores it will have linear total regret
Is it possible to achieve sublinear total regret?

exploration and exploitation的principle：

Naive Exploration：
Add noise to greedy policy (e.g. epo-greedy) ==> greedy/epo-greedy has linear total regret

Optimistic Initialisation：
Assume the best until proven otherwise ==> greedy/epo-greedy + optimistic initialisation has linear total regret

Decaying epo-Greedy Algorithm ：

不断减小epo的值，从多探索到多选择已知最优 ==> Decaying epo-Greedy Algorithm has logarithmic asymptotic total regret

Lower Bound of regret：Asymptotic total regret is at least logarithmic in number of steps

Optimism in the Face of Uncertainty：
Prefer actions with uncertain values

The more uncertain we are about an action-value，The more important it is to explore that action，It could turn out to be the best action

这其中的道理是：不确定的action对应的density function慢慢变得确定，而且reward是大是小非常明显。

After picking blue action（如下图），We are less uncertain about the value，And more likely to pick another action，Until we home in on best action

Probability Matching：
Select actions according to probability they are best

Information State Search：
Lookahead search incorporating value of information

Lecture 10: Classic Games

Minimax Search

Self-Play Reinforcement Learning

Combining Reinforcement Learning and Minimax Search

Reinforcement Learning in Imperfect-Information Games

mmc2015

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
reinforcement learning，增强学习：Exploration and Exploitation

最后两节课分别将bandits和games，基本上是保证课程的完整性，很多内容比较复杂，这里只提一些思想。Lecture 9: Exploration and ExploitationOnline decision-making involves a fundamental choice:ExploitationMake the best decision gi
复制链接

扫一扫

专栏目录