本书资源请到这:Reinforcement Learning: An Introduction资源
Chapter 2 Multi-armed Bandit
Definition of k-armed bandit problem
You are faced repeatedly with a choice among k different options, or
actions. After each choice you receive a numerical reward chosen from
a stationary probability distribution that depends on the action you
selected. Your objective is to maximize the expected total reward over
some time period, for example, over 1000 action selections, or time
steps.
Exploit and explore
If you maintain estimates of the action values, then at any time step there is at least
one action whose estimated value is greatest. We call these the greedy actions. When you
select one of these actions, we say that you are exploiting your current knowledge of the
values of the actions. If instead you select one of the nongreedy actions, then we say you
are exploring, because this enables you to improve your estimate of the nongreedy action’s
value.
denotations: