Reinforcement Learning: An Introduction - Richard S. Sutton Part 1: Tabular Solution Methods

本书资源请到这:Reinforcement Learning: An Introduction资源

Chapter 2 Multi-armed Bandit

Definition of k-armed bandit problem

You are faced repeatedly with a choice among k different options, or
actions. After each choice you receive a numerical reward chosen from
a stationary probability distribution that depends on the action you
selected. Your objective is to maximize the expected total reward over
some time period, for example, over 1000 action selections, or time
steps.

Exploit and explore

If you maintain estimates of the action values, then at any time step there is at least
one action whose estimated value is greatest. We call these the greedy actions. When you
select one of these actions, we say that you are exploiting your current knowledge of the
values of the actions. If instead you select one of the nongreedy actions, then we say you
are exploring, because this enables you to improve your estimate of the nongreedy action’s
value.

denotations:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值