Multi-armed Bandits
- evaluate (evaluative feedback) V.S. instruct (instructive feedback):区分强化学习(前者)和其他学习的最重要特征
- associative V.S. nonassociative:前者指的是when actions are taken in more than one situation
A k k k-armed Bandit Problem
- You are faced repeatedly with a choice among k k k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period.
- The value of an arbitrary action
a
a
a is the expected reward given that
a
a
a is selected:
q ∗ ( a ) ≐ E [ R t ∣ A t = a ] q_{*}(a) \doteq \mathbb{E}\left[R_{t} \mid A_{t}=a\right] q∗(a)≐E[Rt∣At=a]