文章目录
Model Free Control
Optimise the value function of an unknown MDP
On-Policy Monte-Carlo Control
Generalised Policy Iteration
Monte-Carlo Policy Iteration
ONE
-
Policy evaluation: Monte-Carlo poliy evaluation, V = v π V= v_\pi V=vπ? or Q = q π Q = q_\pi Q=qπ?
-
Policy improvement: Greedy policy improvement?
-
Greedy policy improvement over V ( s ) V(s) V(s) requires model of MDP (model-based).
π ′ ( s ) = argmax a ∈ A ( R s a + P s s ′ a V ( s ′ ) ) \pi'(s) = \underset{a \in \mathcal{A}}{\operatorname{argmax}}\left(\mathcal{R}_s^a + \mathcal{P}_{ss'}^a V(s') \right) π′(s)=a∈Aargmax(Rsa+Pss′aV(s′)) -
Greedy policy improvement over Q ( s , a ) Q(s,a) Q(s,a) is model-free.
π ′ ( s ) = argmax a ∈ A Q ( s , a ) \pi'(s) = \underset{a \in \mathcal{A}}{\operatorname{argmax}} Q(s,a) π′(s)=a∈AargmaxQ(s,a)
-
so: ⇓ \Downarrow ⇓
TWO
-
Policy evaluation: Monte-Carlo policy evaluation, Q = q π Q = q_\pi Q=qπ
-
Policy improvement: Greedy policy improvement?
初始化时价值函数一般都是同一个值,若有一个动作比较好,greedy 就会倾向于一直选这个动作,其他动作的好坏并不知道。这就是 exploration 的问题。
ϵ \epsilon ϵ-Greedy Exploration: all m actions are tried with non-zero probability. With probability 1 − ϵ 1-\epsilon 1−ϵ choose the greedy action. With probability ϵ \epsilon ϵ choose an action at random.
π ( a ∣ s ) = { ϵ / m + 1 − ϵ if a ∗ = argmax a ∈ A Q ( s , a ) ϵ / m otherwise \pi(a \mid s) = \begin{cases} \epsilon/m + 1-\epsilon &\text{if}\ a^* = \underset{a \in \mathcal{A}}{\operatorname{argmax}} Q(s,a) \\ \epsilon/m &\text{otherwise} \end{cases} π(a∣s)=⎩⎨⎧ϵ/m+1−ϵϵ/mif a∗=a∈AargmaxQ(s,a)otherwiseTheorem For any ϵ \epsilon ϵ-greedy policy π \pi π, the ϵ \epsilon ϵ-greedy policy π ′ \pi' π′ with respect to q π q_\pi qπ is an improvement, v π ′ ( s ) ≥ v π ( s ) v_{\pi'}(s) \geq v_\pi(s) vπ′(s)≥vπ(s) Proof:
q π ( s , π ′ ( s ) ) = ∑ a ∈ A π ′ ( a ∣ s ) q π ( s , a ) = ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) max a ∈ A q π ( s , a ) = ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) max a ∈ A q π ( s , a ) ∑ a ∈ A π ( a ∣ s ) − ϵ m 1 − ϵ ≥ ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) ∑ a ∈ A π ( a ∣ s ) − ϵ m 1 − ϵ q π ( s , a ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) = v π ( s ) \begin{aligned} q_\pi(s,\pi'(s)) &= \sum_{a \in \mathcal{A}} \pi'(a\mid s) q_\pi(s,a) \\ &= \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon)\underset{a \in \mathcal{A}}{\operatorname{max}} q_\pi(s,a) \\ &= \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon) \underset{a \in \mathcal{A}}{\operatorname{max}} q_\pi(s,a) \sum_{a\in \mathcal{A}}\frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} \\ &\geq \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon)\sum_{a\in \mathcal{A}}\frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} q_\pi(s,a) \\ &= \sum_{a \in \mathcal{A}} \pi(a \mid s)q_\pi(s,a) = v_\pi(s) \end{aligned} qπ(s,π′(s))=a∈A∑π′(a∣s)qπ(s,a)=ϵ/ma∈A∑qπ(s,a)+(1−ϵ)a∈Amaxqπ(s,a)=ϵ/ma∈A∑qπ(s,a)+(1−ϵ)a∈Amaxqπ(s,a)a∈A∑1−ϵπ(a∣s)−mϵ≥ϵ/ma∈A∑qπ(s,a)+(1−ϵ)a∈A∑1−ϵπ(a∣s)−mϵqπ(s,a)=a∈A∑π(a∣s)qπ(s,a)=vπ(s)
解释第三行和第四行。第三行根据 ϵ \epsilon ϵ-greedy 的定义,可以算出第二个求和算式的值为1,因此是等于号。第四行将求和的部分当作 q π ( s , a ) q_\pi(s,a) qπ(s,a) 的权值,很容易想象, q π ( s , a ) q_\pi(s,a) qπ(s,a) 每一项加权求和一定小于等于其最大值。另外可能的误会: π ( a ∣ s ) − ϵ m 1 − ϵ \frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} 1−ϵπ(a∣s)−mϵ 仅 q π ( s , a ) q_\pi(s,a) qπ(s,a) 为最大值时的action计算结果为: π ( a ∣ s ) − ϵ m 1 − ϵ = 1 \frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} =1 1−ϵπ(a∣s)−mϵ=1,其余action计算出来为: π ( a ∣ s ) − ϵ m 1 − ϵ = 0 \frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} = 0 1−ϵπ(a∣s)−mϵ=0,因此第四行应该是等于号。如果这么认为的话,就相当于 π \pi π and π ′ \pi' π′ 没有区别了,要注意 max a ∈ A q π ( s , a ) \underset{a \in \mathcal{A}}{\operatorname{max}}q_\pi(s,a) a∈Amaxqπ(s,a) 中选中的 action 是 π ′ \pi' π′ 根据 q π ( s , a ) q_\pi(s,a) qπ(s,a) 最大值选取的,这个action与 π ( a ∣ s ) = ϵ / m + 1 − ϵ \pi(a \mid s) = \epsilon/m + 1-\epsilon π(a∣s)=ϵ/m+1−ϵ 时的action 并不一定是同一个。 如果是同一个就是等于号了。
so: ⇓ \Downarrow ⇓
THREE
- Policy evaluation: Monte-Carlo policy evaluation, Q = q π Q = q_\pi Q=qπ
- Policy improvement: ϵ \epsilon ϵ-Greedy policy improvement
Pseudocode
Monte-Carlo Control
Every episode
- Policy evaluation: Monte-Carlo policy evaluation, Q ≈ q π Q \approx q_\pi Q≈qπ
- Policy improvement: ϵ \epsilon ϵ-Greedy policy improvement
不用等很多个episode来估算Q,每一个episode完直接更新
GLIE Monte-Carlo Control
Definition of GLIE:
Greedy in the Limit with Infinite Exploration (GLIE)
-
All state-action pairs are explored infinitely many times
lim k → ∞ N k ( s , a ) = ∞ \lim_{k\to \infty}N_k(s,a) = \infty k→∞limNk(s,a)=∞ -
The policy converges on a greedy policy
lim k → ∞ π k ( a ∣ s ) = 1 ( a = arg max a ′ ∈ A Q k ( s , a ′ ) ) \lim_{k\to \infty} \pi_k(a\mid s) = 1\left(a=\underset{a' \in \mathcal{A}}{\operatorname{arg\,max}}Q_k(s,a') \right) k→∞limπk(a∣s)=1(a=a′∈AargmaxQk(s,a′))
Algorithm:
-
Sample kth episode using π : { S 1 , A 1 , R 2 , … , S T } ∼ π \pi: \{S_1, A_1, R_2, \dots, S_T\} \sim \pi π:{S1,A1,R2,…,ST}∼π
-
For each state S t S_t St and action A t A_t At in the episode
N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) ( G t − Q ( S t , A t ) ) \begin{aligned} N(S_t,A_t) &\leftarrow N(S_t, A_t) + 1 \\ Q(S_t,A_t) &\leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)}(G_t - Q(S_t,A_t)) \end{aligned} N(St,At)Q(St,At)←N(St,At)+1←Q(St,At)+N(St,At)1(Gt−Q(St,At)) -
Improve policy based on new action-value function
ϵ ← 1 / k 逐渐增大选择使Q最大的action的概率 π ← ϵ -greedy ( Q ) \begin{aligned} \epsilon &\leftarrow 1/k \qquad \qquad \text{逐渐增大选择使Q最大的action的概率} \\ \pi &\leftarrow \epsilon \text{-greedy}(Q) \end{aligned} ϵπ←1/k逐渐增大选择使Q最大的action的概率←ϵ-greedy(Q)
On-Policy Temporal-Difference Learning
Natural diet: use TD instead of MC in control loop
- Apply TD to Q ( S , A ) Q(S,A) Q(S,A)
- Use ϵ \epsilon ϵ-greedy policy improvement
- Update every time-step
Update Action-Value Functions with Sarsa
On-Policy Control With Sarsa
Every time-step:
Policy evaluation Sarsa, Q ≈ q π Q \approx q_\pi Q≈qπ
Policy improvement ϵ \epsilon ϵ-greedy policy improvement
Sarsa Algorithm:
Sarsa( λ \lambda λ)
n-step Sarsa
- Consider the following n-step returns for n = 1 , 2 , … , ∞ n = 1,2,\dots, \infty n=1,2,…,∞
n = 1 (Sarsa) q t ( 1 ) = R t + 1 + γ Q ( S t + 1 ) n = 2 q t ( 2 ) = R t + 1 + + γ R t + 2 + γ 2 Q ( S t + 1 ) ⋮ n = ∞ (MC) q t ( ∞ ) = R t + 1 + + γ R t + 2 + ⋯ + γ T − 1 R T \begin{aligned} n=1 \text{(Sarsa)} \quad \ q_t^{(1)} &= R_{t+1} + \gamma Q(S_{t+1}) \\ n=2 \qquad \quad \quad \ \ q_t^{(2)} &= R_{t+1} + +\gamma R_{t+2} + \gamma^2 Q(S_{t+1}) \\ \vdots \\ n=\infty \text{(MC)} \quad \ q_t^{(\infty)} &= R_{t+1} + +\gamma R_{t+2} + \dots + \gamma^{T-1}R_T \end{aligned} n=1(Sarsa) qt(1)n=2 qt(2)⋮n=∞(MC) qt(∞)=Rt+1+γQ(St+1)=Rt+1++γRt+2+γ2Q(St+1)=Rt+1++γRt+2+⋯+γT−1RT
-
Define the n-step Q-return
q t ( n ) = R t + 1 + + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n Q ( S t + n ) q_t^{(n)} = R_{t+1} + +\gamma R_{t+2} + \dots + \gamma^{n-1}R_{t+n} + \gamma^n Q(S_{t+n}) qt(n)=Rt+1++γRt+2+⋯+γn−1Rt+n+γnQ(St+n) -
n-step Sarsa updates Q(s,a) towards the n-step Q-return
Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t ( n ) − Q ( S t , A t ) ) Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left(q_t^{(n)} - Q(S_t,A_t) \right) Q(St,At)←Q(St,At)+α(qt(n)−Q(St,At))
Forward View Sarsa( λ \lambda λ)
-
combines all n-step Q-returns q t ( n ) q_t^{(n)} qt(n)
-
Using weight ( 1 − λ ) λ n − 1 (1-\lambda)\lambda^{n-1} (1−λ)λn−1
q t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 q t ( n ) q_t^\lambda = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}q_t^{(n)} qtλ=(1−λ)n=1∑∞λn−1qt(n) -
Forward-view Sarsa( λ \lambda λ)
Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t λ − Q ( S t , A t ) ) Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left(q_t^\lambda - Q(S_t,A_t) \right) Q(St,At)←Q(St,At)+α(qtλ−Q(St,At))
Backward View Sarsa( λ \lambda λ)
参考Lect4中关于TD( λ \lambda λ)的backward-view的部分,这里不详细展开。
-
Just like TD( λ \lambda λ), we use eligibility trace in an online algorithm
-
But Sarsa( λ \lambda λ) has one eligibility trace for each state-action pair
E 0 ( s , a ) = 0 E t ( s , a ) = γ λ E t − 1 ( s , a ) + 1 ( S t = s , A t = a ) \begin{aligned} E_0(s,a) &= 0 \\ E_t(s,a) &= \gamma \lambda E_{t-1}(s,a) + 1(S_t = s, A_t = a) \end{aligned} E0(s,a)Et(s,a)=0=γλEt−1(s,a)+1(St=s,At=a) -
Q ( s , a ) Q(s,a) Q(s,a) is updated for every state s and action a
-
In proportion to TD-error δ t \delta_t δt and eligibility trace E t ( s , a ) E_t(s,a) Et(s,a)
δ t = R t + 1 + γ Q ( S t + 1 , A t + 1 ) − Q ( S t , A t ) Q ( s , a ) ← Q ( S , a ) + α δ t E t ( s , a ) \begin{aligned} \delta_t &= R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \\ Q(s,a) &\leftarrow Q(S,a) + \alpha \delta_t E_t(s,a) \end{aligned} δtQ(s,a)=Rt+1+γQ(St+1,At+1)−Q(St,At)←Q(S,a)+αδtEt(s,a)
Sarsa( λ \lambda λ) Algorithm
Off-Policy Learning
Importance Sampling
Estimate the expectation of a different distribution:
E
X
∼
P
[
f
(
X
)
]
=
∑
P
(
X
)
f
(
X
)
=
∑
Q
(
X
)
P
(
X
)
Q
(
X
)
f
(
X
)
=
E
X
∼
Q
[
P
(
X
)
Q
(
X
)
f
(
X
)
]
\begin{aligned} \mathbb{E}_{X \sim P}[f(X)] &= \sum P(X)f(X) \\ &= \sum Q(X) \frac{P(X)}{Q(X)}f(X) \\ &= \mathbb{E}_{X \sim Q}\left[\frac{P(X)}{Q(X)}f(X) \right] \end{aligned}
EX∼P[f(X)]=∑P(X)f(X)=∑Q(X)Q(X)P(X)f(X)=EX∼Q[Q(X)P(X)f(X)]
Important Sampling for Off-Policy Monte-Carlo
-
Use returns generated from μ \mu μ to evaluate π \pi π
-
Weight return G t G_t Gt according to similarity between policies, multiply improtance sampling corrections along whole episode
G t π / μ = π ( A t ∣ S t ) μ ( A t ∣ S t ) π ( A t + 1 ∣ S t + 1 ) μ ( A t + 1 ∣ S t + 1 ) … π ( A T ∣ S T ) μ ( A T ∣ S T ) G T G_t^{\pi/\mu} = \frac{\pi(A_t \mid S_t)}{\mu(A_t \mid S_t)} \frac{\pi(A_{t+1} \mid S_{t+1})}{\mu(A_{t+1} \mid S_{t+1})} \dots \frac{\pi(A_T \mid S_T)}{\mu(A_T \mid S_T)} G_T Gtπ/μ=μ(At∣St)π(At∣St)μ(At+1∣St+1)π(At+1∣St+1)…μ(AT∣ST)π(AT∣ST)GT -
Update value towards corrected return
V ( S t ) ← V ( S t ) + α ( G t π / μ − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}G_t^{\pi/\mu}} - V(S_t) \right) V(St)←V(St)+α(Gtπ/μ−V(St)) -
Importance sampling can dramatically increase variance
Important Sampling for Off-Policy TD
-
Use TD targets generated from μ \mu μ to evaluate π \pi π
-
Weight TD target R + γ V ( S ′ ) R + \gamma V(S') R+γV(S′) by importance sampling, only need a single improtance sampling correction
V ( S t ) ← V ( S t ) + α ( π ( A t ∣ S t ) μ ( A t ∣ S t ) ( R t + 1 + γ V ( S t + 1 ) ) − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}{\frac{\pi(A_t \mid S_t)}{\mu(A_t \mid S_t)}\left(R_{t+1} + \gamma V(S_{t+1}) \right)}} - V(S_t) \right) V(St)←V(St)+α(μ(At∣St)π(At∣St)(Rt+1+γV(St+1))−V(St)) -
Much lower variance than Monte-Carlo improtance sampling
-
Policies only need to be similar over a single step
Off-Policy Q-Learning
在状态 S t S_t St 时,根据 behavior policy 选取 action: A t ∼ μ ( ⋅ ∣ S t ) A_t \sim \mu(\cdot \mid S_t) At∼μ(⋅∣St),得到相应的奖励 R t + 1 R_{t+1} Rt+1,而后达到了状态 S t + 1 S_{t+1} St+1,此时根据 estimate policy 选取action: A t + 1 ∼ π ( ⋅ ∣ S t + 1 ) A_{t+1} \sim \pi(\cdot \mid S_{t+1}) At+1∼π(⋅∣St+1)。将这个action记为 A ′ A' A′
更新: Q ( S t , A t ) ← Q ( S t , A t ) + α ( R t + 1 + γ Q ( S t + 1 , A ′ ) − Q ( S t , A t ) ) Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left({\color{red}{R_{t+1} + \gamma Q(S_{t+1},A')}}- Q(S_t,A_t) \right) Q(St,At)←Q(St,At)+α(Rt+1+γQ(St+1,A′)−Q(St,At))
Special Case
-
The target policy π \pi π is greedy w.r.t. Q ( s , a ) Q(s,a) Q(s,a)
π ( S t + 1 ) = arg max a ′ Q ( S t + 1 , a ′ ) \pi(S_{t+1}) = \underset{a'}{\operatorname{arg\,max}} Q(S_{t+1},a') π(St+1)=a′argmaxQ(St+1,a′) -
The behavior policy μ \mu μ is ϵ \epsilon ϵ-greedy w.r.t. Q ( s , a ) Q(s,a) Q(s,a)
The Q-learning target then simplifies:
R
t
+
1
+
γ
Q
(
S
t
+
1
,
A
′
)
=
R
t
+
1
+
γ
Q
(
S
t
+
1
,
arg max
a
′
Q
(
S
t
+
1
,
a
′
)
)
=
R
t
+
1
+
max
a
′
γ
Q
(
S
t
+
1
,
a
′
)
\begin{aligned} R_{t+1} + \gamma Q(S_{t+1}, A') &= R_{t+1} + \gamma Q(S_{t+1}, \underset{a'}{\operatorname{arg\,max}} Q(S_{t+1},a')) \\ &= R_{t+1} + \underset{a'}{\operatorname{max}} \gamma Q(S_{t+1},a') \end{aligned}
Rt+1+γQ(St+1,A′)=Rt+1+γQ(St+1,a′argmaxQ(St+1,a′))=Rt+1+a′maxγQ(St+1,a′)
Algorithm: