由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。
对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。
因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。
K-armed bandit problem:
Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
即,用规定的timestep数,找到最优的action(每个action都对应自己的reward distribution,不是说每个action访问一次就可以确切知道action的reward)。用规定的预算,找到最好的广告安排策略。用规定的预算,找到最好的治疗方案都可以近似看作这类问题。Another analogy is that of a doctor choosing between experimental treatments for a series of
seriously ill patients. Each action selection is a treatment selection, and each reward is the survival or well-being of the patient.
考虑action-value function:
Q(a) = sigma{R_i * Indicator (A_i=a)} / sigma{ Indicator (A_i=a)}
在大数定理之下,这种sample-average method计算Q(a)能够保证收敛:As the denominator goes to infinity, by the law of large numbers, Qt(a) converges to q∗(a).
exploration and exploitation:纯粹的exploitation一般不好,需要exploration
e-greedy Action Selection: