Thanks Richard S. Sutton and Andrew G. Barto for their great work of Reinforcement Learning: An Introduction.

Here we talk about some popular action exploration strategies in tabular reinforcement learning system.

Softmax Exploration Strategy

One method that is often used in combination with the RL algorithms is the Beltzmann or softmax exploration strategy.
The action selection strategy is still random, but selection probabilities are weighted by their relative $Q$ -values. This makes it more likely for the agent to choose good actions, whereas two actions that have similar $Q$ -values will have almost the same probability to get selected. Its general form is

P (a) = e Q ( s , a ) T \sum i e Q ( s , a i ) T

$P(a)=\frac{e^{\frac{Q(s,a)}{T}}}{\sum_{i}e^{\frac{Q(s,a_{i})}{T}}}$
in which

P(a) $P(a)$ is the probability of selecting action

a $a$ and

T $T$ is the temperature parameter. Higher values of

T $T$ will move the selection more towards a purely random strategy and lower values will move to a fully greedy strategy.

Upper-Confidence-Bound Action Selection

It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. Another effective way of doing this is to select actions according to

A t = arg max a ⎡ ⎣ Q t (a) + c ln t N t ( a ) - - - - - - \sqrt ⎤ ⎦

$A_{t} = \arg\max_{a}\left[ Q_{t}(a)+c\sqrt{\frac{\ln t}{N_{t}(a)}} \right]$ where

Nt(a) $N_{t}(a)$ denotes the number of times that action

a $a$ has been selected prior to time

t $t$ , and

c>0 $c>0$ controls the degree of exploration.

The idea of this upper con dence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of $a$ ’s value. The quantity being maximized over is thus a sort of upper bound on the possible true value of action $a$ , with c <script type="math/tex" id="MathJax-Element-362">c</script> determining the con dence level. The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.