Q-learning、Expected Sarsa、Double Learning

最新推荐文章于 2024-01-07 19:11:49 发布

fo-in

最新推荐文章于 2024-01-07 19:11:49 发布

阅读量518

点赞数 4

分类专栏： RL

本文链接：https://blog.csdn.net/WZX_Hello/article/details/115429126

版权

RL 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

A brief introduction of Q-learning
Expected Sarsa: A kind of improved algorithm
Double Learning: the way to solve maximization bias
- Maximization bias
- Double Learning
References

A brief introduction of Q-learning

Q-learning is defined as

$Q(S_t, A_t) \leftarrow Q(S_t, A_t)+ \alpha[R_{t+1} + \gamma \mathop{max}\limits_{a} Q(S_{t+1}, a) - Q(S_t, A_t)]$

Here, the learned action-value function, Q, directly approximates $q_*$ , the optimal action-value function, independent of the policy being followed, and that’s why it’s off-policy.

All that is required for correct convergence is that all pairs continue to be updated.

So, there is no need to consider which state-action pairs are selected to be visited or updated. Q is bound to converge with probability 1 to $q_*$ . And its updating process can be shown as:

在这里插入图片描述
And according to the process, we can also draw its backup diagram. Noticing that Q-learning needs to maximize over all actions possible, and we use an arc across all action nodes to indicate this.

在这里插入图片描述

Expected Sarsa: A kind of improved algorithm

Just as Q-learning maximizes over all the possible actions in the state, how about we use the expected value instead? Following the idea, we can change the Q-learning algorithm into:

$Q(S_t, A_t) \leftarrow Q(S_t, A_t)+ \alpha[R_{t+1} + \gamma E_\pi[ Q(S_{t+1}, A_{t+1}) | S_{t+1}] - Q(S_t, A_t)] \newline \qquad \qquad \ \leftarrow Q(S_t, A_t)+ \alpha[R_{t+1} + \gamma \sum_{a} \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t)]$

Given the next state, $S_{t+1}$ , this algorithm moves deterministically in the same direction as Sarsa moves in expectation, and accordingly it is called Expected Sarsa.

Because Expected Sarsa uses calculate the expectation value of all state-action pairs, it eliminates the variance due to random selection of $A_{t+1}$ , but increases the complexity of computation at the same time.

In the case of deterministic state transition(e.g. Cliff Walking), Expected Sarsa performs better than Sarsa, cause it seems adaptive to a much wider range of values for the step-size parameter $\alpha$ , even when $\alpha = 1$ , whereas Sarsa can only perform well in the long run at a small value of $\alpha$ . The figure below can clearly show this: 在这里插入图片描述
It’s worth noting that Expected Sarsa can be an off-policy algorithm, for instance,

suppose $\pi$ is the greedy policy while behavior is more exploratory, then Expected Sarsa is exactly Q-learning.

That means Expected Sarsa improves the performance of Sarsa, and extends Q-learning at the same time.

The last question, how does the backup diagram of Expected Sarsa should be? According to the algorithm, the calculation of expectation determines it must consider all the possible actions at a state, so based on the backup diagram of Q-learning, we only need to get rid of the arc across all action nodes, then the backup diagram of Expected Sarsa is completed.
在这里插入图片描述

Double Learning: the way to solve maximization bias

Maximization bias

Because the target policies of Q-learning and Sarsa is greedy or greedy-oriented, a maximum over estimated values is used implicitly as an estimate of the maximum value,which will lead to a significant positive bias, and we call it “maximization bias”. This actually is easy to understand in the following example:

Considering a single state s where there are many actions a whose true values, $q (s, a)$ , are all zero but whose estimated values, $Q (s, a)$ , are uncertain and thus distributed some above and some below zero. The maximum of the true value is absolutely zero, but the maximum of the estimates is clearly positive, a positive bias.

The maximization bias can harm the performance of TD control algorithms and mislead the agent to choose the worse action just because its fake immediate reward.

Double Learning

The double learning is a clever way to solve the maximization bias. It divides the time steps into two, and they function independently, one for determining, another for estimating. And the estimate is unbiased in the sense that $E[ Q(A^*)]=q(A^*)$ .
It can be expressed as $Q_1(argmax_aQ_2(a))$ , of course, the role of $Q_1$ and $Q_2$ can be reversed.

Double learning doubles the memory requirements, but does not increase the amount of computation per step.

If apply double learning to Q-learning, we get Double Q-learning, and its update role becomes:

$Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t)+\alpha[R_{t+1} + \gamma Q_2(S_{t+1}, \underset{a}{argmax}Q_1(S_{t+1}, a))-Q_1(S_t, A_t)]$

Here, we use $Q_1$ to determine the maximum action $A^*$ , $Q_2$ to estimate it, and finally ,we update $Q_1$ .

A complete algorithm for Double Q-learning is given below:

在这里插入图片描述
Of course, Sarsa and Expected Sarsa also have their double learning version. Here comes the update equations for Double Expected Sarsa with an $\varepsilon greedy$ target policy (increment $Q_1$ ):
$Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t) + \alpha [R_{t+1} + \gamma (\frac{\varepsilon}{|A(a)|} \sum_{a}Q_2(S_{t+1}, a) + (1- \varepsilon) \underset{a}{max} \{Q_2(S_{t+1}, a)\}) - Q_1(S_t, A_t)]$