A Brief Note about Action Exploration Strategies

Thanks Richard S. Sutton and Andrew G. Barto for their great work of Reinforcement Learning: An Introduction.

Here we talk about some popular action exploration strategies in tabular reinforcement learning system.

Softmax Exploration Strategy

One method that is often used in combination with the RL algorithms is the Beltzmann or softmax exploration strategy.
The action selection strategy is still random, but selection probabilities are weighted by their relative Q -values. This makes it more likely for the agent to choose good actions, whereas two actions that have similar Q-values will have almost the same probability to get selected. Its general form is

P(a)=eQ(s,a)TieQ(s,ai)T

in which P(a) is the probability of selecting action a and T is the temperature parameter. Higher values of T will move the selection more towards a purely random strategy and lower values will move to a fully greedy strategy.

Upper-Confidence-Bound Action Selection

It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. Another effective way of doing this is to select actions according to

At=argmaxaQt(a)+clntNt(a)
where Nt(a) denotes the number of times that action a has been selected prior to time t, and c>0 controls the degree of exploration.

The idea of this upper con dence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a ’s value. The quantity being maximized over is thus a sort of upper bound on the possible true value of action a, with c <script type="math/tex" id="MathJax-Element-362">c</script> determining the con dence level. The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值