Sarsa: One of classical algorithms of RL

What is TD learning?

“TD learning” means “temporal-difference learning”, which is a combination of Monte Carlo ideas(MC) and dynamic programming (DP) ideas. Here we mainly focus on the differences between TD and MC. As we know, based on the MC, the goal for updating is to converge to the real return, which means only an episode is over, can the update occur, while TD methods, whose updating targets can be immediately found as the next time step ends, appear much more direct. To show the difference more visually, the following update rules are given:
M C : V ( S t ) ← V ( S t ) + α [ G t − V ( S t ) ] MC:V(S_t) \leftarrow V(S_t) + \alpha [G_t-V(S_t)] MC:V(St)V(St)+α[GtV(St)]

T D    m e t h o d s : V ( S t ) ← V ( S t ) + α [ R t + 1 + γ V ( S t + 1 ) − V ( S t ) ] TD \:\: methods:V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1}+ \gamma V(S_{t+1})-V(S_t)] TDmethods:V(St)V(St)+α[Rt+1+γV(St+1)V(St)]
From the update rules above, the target of TD update is entirely based on the next time step, such we call it TD(0), or one-step TD, because it is a special case of TD( λ \lambda λ).
Here comes the pseudo-code of TD(0):

在这里插入图片描述

On policy and Off-policy

These two approaches were originally proposed to solve the problem of ensuring all actions are selected infinitely in MC methods.

The main difference between the two is whether the policy for generating the collection data sequence is consistent with the policy to be evaluated and improved for the actual decision. With reference to the description of every-visit MC method suitable for non-stationary environment, we describe the process of updating an existing policy as:
v a l u e ← v a l u e + α ( t a r g e t    v a l u e − v a l u e ) value \leftarrow value+\alpha(target \ \ value-value) valuevalue+α(target  valuevalue)
Here, the “target value” and “value” in parentheses imply the policy to be evaluated and the one to generate data sets. If the two are based on the same policy, then the algorithm is on-policy, otherwise, off-policy.

Theoretically speaking, on-policy algorithms can only be trained by the immediate data generated by the currently optimizing policy, which means that once the policy network data has been updated by a data set, the “currently optimizing” policy changes immediately, and it becomes a new rule to generate a new data set to continually train the next step and only the next step. With constantly shifting policy of generating training data set, on-policy algorithm finally approach the goal to get the same effect as choosing the best action step by step, however, the “best” might be the “locally optimal” instead of “globally optimal”.

On the contrary, the policy used for sample generation in off-policy algorithms is not the same as the one used for evaluation, the latter policy is usually constant, say, “the best”, but the policy actually selected to generate data sets can’t be the best all the time. Based on this article, we can conclude that the action based on experience is essentially off-policy.

Here comes the explanation from textbook:

On-policy methods attempt to evaluate or improve a policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.

A brief introduction of Sarsa

Sarsa is a kind of on-policy TD control method,which describes an episode consists of an alternating sequence of states and state-action pairs. Noticing that the unit of transition we consider here is state-action pair instead of just state. The rule for updating action values can be formulated as:
Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ Q ( S t + 1 , A t + 1 ) − Q ( S t , A t ) ] Q(S_t, A_t)\leftarrow Q(S_t, A_t)+\alpha [R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t)] Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]
It’s obvious that the update occurs after every transition from a non-terminal S t S_t St. Once S t + 1 S_{t+1} St+1is the terminal state, Q ( S t + 1 , A t + 1 ) Q(S_{t+1}, A_{t+1}) Q(St+1,At+1) must be zero.

In brief, the core of the rule we introduced above is a quintuple, ( S t , A t , R t + 1 , S t + 1 , A t + 1 ) (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) (St,At,Rt+1,St+1,At+1), that describe a transition from one state-action pair to another. And it’s backup diagram is as follows:
the black dot represents state and the circle represents state-action pair
The on-policy attribute of Sarsa is embodied in that as the process of estimating q π q_{\pi} qπ, A t + 1 A_{t+1} At+1 is guided by a certain policy π \pi π, so Q ( S t + 1 , A t + 1 ) Q(S_{t+1}, A_{t+1}) Q(St+1,At+1) is also the Q-value based on the policy π \pi π, which stays constant through out the process.

The complete process of the S a r s a Sarsa Sarsa algorithm can be described by the following pseudo-code:

在这里插入图片描述

References

[1]. Reinforcement Learning-An introduction
[2]. https://zhuanlan.zhihu.com/p/59792208

If there is infringement, delete immediately

评论 13
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值