Sarsa(λ) and Q(λ) in Tabular Case

最新推荐文章于 2023-06-14 17:17:35 发布

止于至玄

最新推荐文章于 2023-06-14 17:17:35 发布

阅读量2k

点赞数

分类专栏： Reinforcement Learning 文章标签：强化学习

Reinforcement Learning 专栏收录该内容

24 篇文章 8 订阅

订阅专栏

‘Thanks R. S. Sutton and A. G. Barto for their great work in Reinforcement Learning: An Introduction.

Eligibility Traces in Prediction Problems

In the backward view of $TD(\lambda)$ , there is a memory variable associated with each state, its eligibility trace. The eligibility trace for state $s$ at time $t$ is a random variable denoted $Z_{t}(s)\in\mathbb{R}^{+}$ . On each step, the eligibility traces for all states decay by $\gamma\lambda$ , and the eligibility trace for the one state visited on the step is incremented by 1:

Z t (s) = {γ λ Z t - 1 (s) γ λ Z t - 1 (s) + 1 s \neq S t s = S t

$Z_{t}(s)=\left\{\begin{aligned} &\gamma\lambda Z_{t-1}(s) &s\neq S_{t}\\ &\gamma\lambda Z_{t-1}(s)+1 &s=S_{t}\end{aligned}\right.$
for all nonterminal states

s $s$ , where

γ $\gamma$ is the discount rate and

λ $\lambda$ is the $\lambda$ -return 1 parameter or trace-decay parameter. This kind of eligibility trace is called an accumulating trace. The global TD error signal triggers proportional updates to all recently visited states, as signaled by their nonzero traces:

Δ V t (s) = α δ t Z t (s), \forall s \in S

$\Delta V_{t}(s)=\alpha\delta_{t}Z_{t}(s), \forall s\in\mathcal{S}$
where

δ t = R t + 1 + γ V t (S t + 1) - V t (S t)

$\delta_{t}=R_{t+1}+\gamma V_{t}(S_{t+1})-V_{t}(S_{t})$
A complete prediction algorithm for on-line

TD(λ) $TD(\lambda)$ is given as follows:

Initialize $V(s)$ arbitrarily.
Repeat for each episode:
1. Initialize $Z(s)=0$ for all $s\in\mathcal{S}$ .
2. Repeat for each step of episode:
  1. $A \leftarrow$ action given by $\pi$ for $S$ .
  2. Observe reward $R$ and the next state $S'$ .
  3. $\delta\leftarrow R+\gamma V(S')-V(S)$
  4. $Z(S)\leftarrow Z(S)+1$
  5. For all s∈S :
    1. $V(s)\leftarrow V(s)+\alpha\delta Z(s)$
    2. $Z(s)\leftarrow \gamma\lambda Z(s)$
  6. $S\leftarrow S'$
3. Until $S$ is terminal.

Sarsa(λ)

The main idea of control is simply to learn action values rather than state values. The idea in Sarsa( $\lambda$ ) is to apply the TD( $\lambda$ ) prediction method to state-action pairs. Let $Z_{t}(s,a)$ denote the trace for state-action pair $s,a$ . Substitute state-action variables for state variables, i.e.

Q t + 1 (s, a) = Q t (s, a) + α δ t Z t (s, a), \forall s, a

$Q_{t+1}(s,a)=Q_{t}(s,a)+\alpha\delta_{t}Z_{t}(s,a),\quad \forall s,a$
where

δ t = R t + 1 + γ Q t (S t + 1, A t + 1) - Q t (S t, A t)

$\delta_{t}=R_{t+1}+\gamma Q_{t}(S_{t+1},A_{t+1})-Q_{t}(S_{t},A_{t})$
and

Z t (s, a) = {γ λ Z t - 1 (s, a) + 1 γ λ Z t - 1 (s, a) s = S t, a = A t otherwise for all s, a

$Z_{t}(s,a)=\left\{\begin{aligned} &\gamma\lambda Z_{t-1}(s,a)+1 &s=S_{t}, a=A_{t}\\ &\gamma\lambda Z_{t-1}(s,a) &\text{otherwise} \end{aligned}\right. \quad\text{ for all }s,a$
The complete Sarsa(

λ $\lambda$ ) algorithm is given as follows:

Initialize $Q(s,a)$ arbitrarily for all $s\in\mathcal{S},a\in\mathcal{A}(s)$
Repeat for each episode:
1. $Z(s,a)=0$ for all $s\in\mathcal{S},a\in\mathcal{A}(s)$
2. Repeat for each step of episode:
  1. Take action $A$ , observe $R$ , $S'$ .
  2. Choose $A'$ from $S'$ using policy derived from $Q$ .
  3. $\delta\leftarrow R+\gamma Q(S',A')-Q(S,A)$
  4. $Z(S,A)\leftarrow Z(S,A)+1$
  5. For all s∈S,a∈A(s) :
    1. $Q(s,a)\leftarrow Q(s,a)+\alpha\delta Z(s,a)$
    2. $Z(s,a)\leftarrow\gamma\lambda Z(s,a)$
  6. $S\leftarrow S', A\leftarrow A'$
3. Unitl $S$ is terminal.

Q(λ)

There are two different methods have been proposed that combine eligibility traces and Q-Learning: the Watkins’s Q(λ) and Peng’s Q(λ). Here we focus on Watkin’s Q(λ) only and give a brief description of Peng’s Q(λ).

Unlike TD(λ) and Sarsa(λ), Watkins’s Q(λ) does not look ahead all the way to the end of the episode in its backup. It only looks ahead as far as the next exploratory action. For example, suppose the first action, $A_{t+1}$ , is exploratory. Watkins’s Q(λ) would still do the one-step update of $Q_{t}(S_{t},A_{t})$ toward $R_{t+1}+\gamma\max_{a}Q(S_{t+1},a)$ . In general, if $A_{t+n}$ is the first exploratory action, then the longest backup is toward

R t + 1 + γ R t + 1 + \dots + γ n - 1 R t + n + γ n max a Q (S t + n, a)

$R_{t+1}+\gamma R_{t+1}+\cdots+\gamma^{n-1}R_{t+n}+\gamma^{n}\max_{a}Q(S_{t+n},a)$
where we assume off-line updating.

The mechanistic or backward view of Watkins’s Q(λ) is also very simple. Eligibility traces are used just as in Sarsa(λ), except that they are set to zero whenever an exploratory action is taken. That is to say. First, the traces for all state-action pairs are either decayed by $\gamma\lambda$ or, if an exploratory action was taken, set to 0. Second, the trace corresponding to the current state and action is incremented by 1, i.e.

Z t (s, a) = I s S t \cdot I a A t + ⎧ ⎩ ⎨ γ λ Z t - 1 (s, a) 0 if Q t - 1 (S t, A t) = max a Q t - 1 (S t, a) otherwise

$Z_{t}(s,a)=I_{sS_{t}}\cdot I_{aA_{t}}+\left\{\begin{aligned} &\gamma\lambda Z_{t-1}(s,a) &\text{ if } Q_{t-1}(S_{t},A_{t})=\max_{a}Q_{t-1}(S_{t},a)\\&0 &\text{ otherwise} \end{aligned}\right.$
where

Ixy $I_{xy}$ is an identity indicator function, equal to 1 if

x=y $x=y$ and 0 otherwise. The rest of the algorithm is defined by

Q t + 1 (s, a) = Q t (s, a) + α δ t Z t (s, a), \forall s \in S, a \in A (s)

$Q_{t+1}(s,a)=Q_{t}(s,a)+\alpha\delta_{t} Z_{t}(s,a),\quad\forall s\in\mathcal{S}, a\in\mathcal{A}(s)$
where

δ t = R t + 1 + γ max a' Q t (S t + 1, a') - Q t (S t, A t)

$\delta_{t}=R_{t+1}+\gamma\max_{a'}Q_{t}(S_{t+1},a')-Q_{t}(S_{t},A_{t})$
The complete algorithm is given below:

Initialize $Q(s,a)$ arbitrarily, for all $s\in\mathcal{S},a\in\mathcal{A}(s)$ .
Repeat for each episode:
1. $Z(s,a)=0$ for all $s\in\mathcal{S},a\in\mathcal{A}(s)$
2. Repeat for each step of episode:
  1. Take action $A$ , observe $R$ , $S'$
  2. Choose $A'$ from $S'$ using policy derived from $Q$ .
  3. $A^{*}\leftarrow\arg\max_{a}Q(S',a)$ , if $A'$ ties for the max, then $A^{*}\leftarrow A'$
  4. $\delta\leftarrow R+\gamma Q(S',A^{*})-Q(S,A)$
  5. $Z(S,A)\leftarrow Z(S,A)+1$
  6. For all s∈S,a∈A(s) :
    1. $Q(s,a)\leftarrow Q(s,a)+\alpha\delta Z(s,a)$
    2. If $A'=A^{*}$ , then $Z(s,a)\leftarrow\gamma\lambda Z(s,a)$ , else $Z(s,a)\leftarrow 0$ .
  7. $S\leftarrow S', A\leftarrow A'$
3. Until $S$ is terminal

Unfortunately, cutting off traces every time an exploratory action is taken loses much of the advantage of using eligibility traces. Peng’s Q(λ) is an alternate version of Q(λ) meant to remedy this, which can be thought of as a hybrid of Sarsa(λ) and Watkins’s Q(λ). In Peng’s Q(λ), there is no distinction between exploratory and greedy actions. Each component beckup is over many steps of actual experiences, and all but the last are capped by a final maximization over actions. For a fixed nongreedy policy, $Q_{t}$ converges to neither $q_{\pi}$ nor $q_{*}$ , but to some hybrid of the two. However, if the policy is gradually made more greedy, then the method may still converge to $q_{*}$ .

Replacing Traces

In some cases significantly better performance can be obtained by using a slightly modified kind of trace known as replacing trace:

Z t (s) = {γ λ Z t - 1 (s) 1 s \neq S t s = S t

$Z_{t}(s)=\left\{\begin{aligned} &\gamma\lambda Z_{t-1}(s) & s\neq S_{t}\\ & 1 & s=S_{t}\end{aligned}\right.$
There are several possible ways to generalize replacing eligibility traces for use in control methods. In some cases, the state-action traces are updated by the following:

Z t (s, a) = ⎧ ⎩ ⎨ ⎪ ⎪ 10 γ λ Z t - 1 (s, a) s = S t, a = A t s = S t, a \neq A t s \neq S t

$Z_{t}(s,a)=\left\{\begin{aligned} &1 &s=S_{t}, a=A_{t}\\ &0 & s=S_{t},a\neq A_{t}\\ &\gamma\lambda Z_{t-1}(s,a) & s\neq S_{t}\end{aligned}\right.$

$\lambda$ -return: $G_{t}^{\lambda}=(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{(n)}$ ↩

止于至玄

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Sarsa(λ) and Q(λ) in Tabular Case

Eligibility Traces in Prediction ProblemsIn the backward view of TD(λ)TD(\lambda), there is a memory variable associated with each state, its eligibility trace...
复制链接

扫一扫

专栏目录