《reinforcement learning:an introduction》第二章《Multi-arm Bandits》总结

这篇博客介绍了强化学习入门书籍《reinforcement learning:an introduction》第二章关于Multi-arm Bandits的内容。博主讨论了K-armed bandit问题、action-value函数、exploration与exploitation的平衡,提到了e-greedy和UCB策略。还探讨了TD学习、非确定性和非静态环境,并指出在实际应用中常使用固定步长的TD学习。最后,提到了梯度带宽算法和上下文带宽数学习。
摘要由CSDN通过智能技术生成

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。




K-armed bandit problem:

    Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps. 

    即,用规定的timestep数,找到最优的action(每个action都对应自己的reward distribution,不是说每个action访问一次就可以确切知道action的reward)。用规定的预算,找到最好的广告安排策略。用规定的预算,找到最好的治疗方案都可以近似看作这类问题。Another analogy is that of a doctor choosing between experimental treatments for a series of
seriously ill patients. Each action selection is a treatment selection, and each reward is the survival or well-being of the patient.



考虑action-value function:

    Q(a) = sigma{R_i * Indicator (A_i=a)} /  sigma{ Indicator (A_i=a)}

    在大数定理之下,这种sample-average method计算Q(a)能够保证收敛:As the denominator goes to infinity, by the law of large numbers, Qt(a) converges to q∗(a). 



exploration and exploitation:纯粹的exploitation一般不好,需要exploration

    e-greedy Action Selection

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值