Q-learning、Expected Sarsa、Double Learning

A brief introduction of Q-learning

Q-learning is defined as

Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ m a x a Q ( S t + 1 , a ) − Q ( S t , A t ) ] Q(S_t, A_t) \leftarrow Q(S_t, A_t)+ \alpha[R_{t+1} + \gamma \mathop{max}\limits_{a} Q(S_{t+1}, a) - Q(S_t, A_t)] Q(St,At)Q(St,At)+α[Rt+1+γamaxQ(St+1,a)Q(St,At)]

Here, the learned action-value function, Q, directly approximates q ∗ q_* q, the optimal action-value function, independent of the policy being followed, and that’s why it’s off-policy.

All that is required for correct convergence is that all pairs continue to be updated.

So, there is no need to consider which state-action pairs are selected to be visited or updated. Q is bound to converge with probability 1 to q ∗ q_* q. And its updating process can be shown as:

在这里插入图片描述
And according to the process, we can also draw its backup diagram. Noticing that Q-learning needs to maximize over all actions possible, and we use an arc across all action nodes to indicate this.

在这里插入图片描述

Expected Sarsa: A kind of improved algorithm

Just as Q-learning maximizes over all the possible actions in the state, how about we use the expected value instead? Following the idea, we can change the Q-learning algorithm into:

Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ E π [ Q ( S t + 1 , A t + 1 ) ∣ S t + 1 ] − Q ( S t , A t ) ]   ← Q ( S t , A t ) + α [ R t + 1 + γ ∑ a π ( a ∣ S t + 1 ) Q ( S t + 1 , a ) − Q ( S t , A t ) ] Q(S_t, A_t) \leftarrow Q(S_t, A_t)+ \alpha[R_{t+1} + \gamma E_\pi[ Q(S_{t+1}, A_{t+1}) | S_{t+1}] - Q(S_t, A_t)] \newline \qquad \qquad \ \leftarrow Q(S_t, A_t)+ \alpha[R_{t+1} + \gamma \sum_{a} \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t)] Q(St,At)Q(St,At)+α[Rt+1+γEπ[Q(St+1,At+1)St+1]Q(St,At)] Q(St,At)+α[Rt+1+γaπ(aSt+1)Q(St+1,a)Q(St,At)]

Given the next state, S t + 1 S_{t+1} St+1, this algorithm moves deterministically in the same direction as Sarsa moves in expectation, and accordingly it is called Expected Sarsa.

Because Expected Sarsa uses calculate the expectation value of all state-action pairs, it eliminates the variance due to random selection of A t + 1 A_{t+1} At+1, but increases the complexity of computation at the same time.

In the case of deterministic state transition(e.g. Cliff Walking), Expected Sarsa performs better than Sarsa, cause it seems adaptive to a much wider range of values for the step-size parameter α \alpha α, even when α = 1 \alpha = 1 α=1, whereas Sarsa can only perform well in the long run at a small value of α \alpha α. The figure below can clearly show this:在这里插入图片描述
It’s worth noting that Expected Sarsa can be an off-policy algorithm, for instance,

suppose π \pi π is the greedy policy while behavior is more exploratory, then Expected Sarsa is exactly Q-learning.

That means Expected Sarsa improves the performance of Sarsa, and extends Q-learning at the same time.

The last question, how does the backup diagram of Expected Sarsa should be? According to the algorithm, the calculation of expectation determines it must consider all the possible actions at a state, so based on the backup diagram of Q-learning, we only need to get rid of the arc across all action nodes, then the backup diagram of Expected Sarsa is completed.
在这里插入图片描述

Double Learning: the way to solve maximization bias

Maximization bias

Because the target policies of Q-learning and Sarsa is greedy or greedy-oriented, a maximum over estimated values is used implicitly as an estimate of the maximum value,which will lead to a significant positive bias, and we call it “maximization bias”. This actually is easy to understand in the following example:

Considering a single state s where there are many actions a whose true values, q ( s , a ) q(s, a) q(s,a), are all zero but whose estimated values, Q ( s , a ) Q(s, a) Q(s,a), are uncertain and thus distributed some above and some below zero. The maximum of the true value is absolutely zero, but the maximum of the estimates is clearly positive, a positive bias.

The maximization bias can harm the performance of TD control algorithms and mislead the agent to choose the worse action just because its fake immediate reward.

Double Learning

The double learning is a clever way to solve the maximization bias. It divides the time steps into two, and they function independently, one for determining, another for estimating. And the estimate is unbiased in the sense that E [ Q ( A ∗ ) ] = q ( A ∗ ) E[ Q(A^*)]=q(A^*) E[Q(A)]=q(A).
It can be expressed as Q 1 ( a r g m a x a Q 2 ( a ) ) Q_1(argmax_aQ_2(a)) Q1(argmaxaQ2(a)), of course, the role of Q 1 Q_1 Q1 and Q 2 Q_2 Q2 can be reversed.

Double learning doubles the memory requirements, but does not increase the amount of computation per step.

If apply double learning to Q-learning, we get Double Q-learning, and its update role becomes:

Q 1 ( S t , A t ) ← Q 1 ( S t , A t ) + α [ R t + 1 + γ Q 2 ( S t + 1 , a r g m a x a Q 1 ( S t + 1 , a ) ) − Q 1 ( S t , A t ) ] Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t)+\alpha[R_{t+1} + \gamma Q_2(S_{t+1}, \underset{a}{argmax}Q_1(S_{t+1}, a))-Q_1(S_t, A_t)] Q1(St,At)Q1(St,At)+α[Rt+1+γQ2(St+1,aargmaxQ1(St+1,a))Q1(St,At)]

Here, we use Q 1 Q_1 Q1 to determine the maximum action A ∗ A^* A, Q 2 Q_2 Q2 to estimate it, and finally ,we update Q 1 Q_1 Q1.

A complete algorithm for Double Q-learning is given below:

在这里插入图片描述
Of course, Sarsa and Expected Sarsa also have their double learning version. Here comes the update equations for Double Expected Sarsa with an ε g r e e d y \varepsilon greedy εgreedy target policy (increment Q 1 Q_1 Q1):
Q 1 ( S t , A t ) ← Q 1 ( S t , A t ) + α [ R t + 1 + γ ( ε ∣ A ( a ) ∣ ∑ a Q 2 ( S t + 1 , a ) + ( 1 − ε ) m a x a { Q 2 ( S t + 1 , a ) } ) − Q 1 ( S t , A t ) ] Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t) + \alpha [R_{t+1} + \gamma (\frac{\varepsilon}{|A(a)|} \sum_{a}Q_2(S_{t+1}, a) + (1- \varepsilon) \underset{a}{max} \{Q_2(S_{t+1}, a)\}) - Q_1(S_t, A_t)] Q1(St,At)Q1(St,At)+α[Rt+1+γ(A(a)εaQ2(St+1,a)+(1ε)amax{Q2(St+1,a)})Q1(St,At)]

References

[1]. Reinforcement Learning-An introduction

If there is infringement, promise to delete immediately

  • 4
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值