文章目录
Planning by Dynamic Programming
Introduction
Dynamic sequential or temporal component to the problem Programming optimising a “program”
Requirements for DP
- Optimal substructure: 将问题分解成多个子问题,寻找子问题的最优解而后组合起来得到大问题的最优解
- Overlapping subproblems: 子问题出现多次并且其解决方案可以重复利用
MDP satisfy both requirements:
- Bellman equation gives recursive decomposition.
- Value function stores and reuses solutions
DP used for planning in an MDP
Prediction:
Control:
Policy Evaluation
Iterative Policy Evaluation
Iterative application of [ Bellman expectation backup \text{Bellman {\color{red}expectation} backup} Bellman expectation backup](#Bellman expectation backup): v 1 → v 2 → … → v π \operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_\pi v1→v2→…→vπ
- Using Synchornous backups:
- at each iteration k + 1 k+1 k+1
- for all states s ∈ S s \in \mathcal{S} s∈S
- update v k + 1 ( s ) \operatorname{v}_{k+1}(s) vk+1(s) from v k ( s ′ ) \operatorname{v}_{k}(s') vk(s′)
- where s ′ s' s′ is a successor state of s s s
- Convergence to v π \operatorname{v}_\pi vπ can be proved
How to update:
v
k
+
1
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
(
R
s
a
+
∑
s
′
∈
S
P
s
s
′
a
v
k
(
s
′
)
)
\operatorname{v}_{\color{red}{k+1}}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right) \\
vk+1(s)=a∈A∑π(a∣s)(Rsa+s′∈S∑Pss′avk(s′))
不由得想起数值分析里面迭代求解
x
=
f
(
x
)
x = f(x)
x=f(x),即
initial
x
0
=
number
,
then loop
x
k
+
1
=
f
(
x
k
)
\text{initial}\ x_0 = \text{number}, \ \text{then loop}\ x_{k+1} = f(x_k)
initial x0=number, then loop xk+1=f(xk)
Matrix From:
v
k
+
1
=
R
π
+
γ
P
π
v
k
\mathbf{v}^{k+1} = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi \mathbf{v}^k
vk+1=Rπ+γPπvk
where
R
π
=
∑
a
∈
A
π
(
a
∣
s
)
R
s
a
P
π
=
∑
s
′
∈
s
P
s
s
′
π
=
∑
s
′
∈
S
∑
a
∈
A
π
(
a
∣
s
)
P
s
s
′
a
=
∑
a
∈
A
π
(
a
∣
s
)
∑
s
′
∈
S
P
s
s
′
a
\begin{aligned} \mathcal{R}^\pi &= \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{R}_s^a \\ \mathcal{P}^\pi &= \sum_{s' \in \mathcal{s}} \mathcal{P}_{ss'}^\pi = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{P}_{ss'}^a = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \end{aligned}
RπPπ=a∈A∑π(a∣s)Rsa=s′∈s∑Pss′π=s′∈S∑a∈A∑π(a∣s)Pss′a=a∈A∑π(a∣s)s′∈S∑Pss′a
Example
- γ = 1 \gamma =1 γ=1
- Nonterminal states 1, …, 14; One terminal state (shown twice as shaded squares)
- Actions leading out of the grid leave state unchanged, e.g. when s = 4 , a = west s = 4, a = \text{west} s=4,a=west, the next state will be s ′ = 4 s' = 4 s′=4
- Transition is deterministic given the action, e.g. P 62 north = P [ s ′ = 2 ∣ s = 6 , a = north ] = 1 \mathcal{P}_{62}^{\text{north}} = \mathcal{P} \left[s' = 2 \mid s=6, a=\text{north}\right] =1 P62north=P[s′=2∣s=6,a=north]=1
- Reward is -1 until the terminal state is reached
- Uniform random policy, π ( n ∣ ⋅ ) = π ( e ∣ ⋅ ) = π ( w ∣ ⋅ ) = π ( s ∣ ⋅ ) = 0.25 \pi(n \mid ·) = \pi(e \mid ·) = \pi(w \mid ·) = \pi(s \mid ·) = 0.25 π(n∣⋅)=π(e∣⋅)=π(w∣⋅)=π(s∣⋅)=0.25
一开始给每一个状态的价值函数都初始化为0,不断地进行迭代就好了。如下:
计算过程如下:
For k=0:
v
0
(
s
)
=
0
∀
s
\operatorname{v}_0(s) = 0 \qquad \forall s
v0(s)=0∀s
For k = 1: e.g.
v
1
(
s
=
4
)
=
π
(
a
=
n
∣
s
=
4
)
∗
(
R
s
=
4
a
=
n
+
P
s
=
4
,
s
′
=
terminal
a
=
n
v
0
(
s
′
=
terminal
)
)
+
=
π
(
a
=
w
∣
s
=
4
)
∗
(
R
s
=
4
a
=
w
+
P
s
=
4
,
s
′
=
4
a
=
w
v
0
(
s
′
=
4
)
)
+
=
π
(
a
=
s
∣
s
=
4
)
∗
(
R
s
=
4
a
=
s
+
P
s
=
4
,
s
′
=
8
a
=
s
v
0
(
s
′
=
8
)
)
+
=
π
(
a
=
e
∣
s
=
4
)
∗
(
R
s
=
4
a
=
e
+
P
s
=
4
,
s
′
=
5
a
=
e
v
0
(
s
′
=
5
)
)
+
=
0.25
∗
(
−
1
+
0
)
+
0.25
∗
(
−
1
+
0
)
+
0.25
∗
(
−
1
+
0
)
+
0.25
∗
(
−
1
+
0
)
=
−
1.0
\begin{aligned} \operatorname{v}_1(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_0(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_0(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_0(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_0(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)=-1.0 \end{aligned}
v1(s=4)=π(a=n∣s=4)∗(Rs=4a=n+Ps=4,s′=terminala=nv0(s′=terminal))+=π(a=w∣s=4)∗(Rs=4a=w+Ps=4,s′=4a=wv0(s′=4))+=π(a=s∣s=4)∗(Rs=4a=s+Ps=4,s′=8a=sv0(s′=8))+=π(a=e∣s=4)∗(Rs=4a=e+Ps=4,s′=5a=ev0(s′=5))+=0.25∗(−1+0)+0.25∗(−1+0)+0.25∗(−1+0)+0.25∗(−1+0)=−1.0
For k = 2: e.g.
v
2
(
s
=
4
)
=
π
(
a
=
n
∣
s
=
4
)
∗
(
R
s
=
4
a
=
n
+
P
s
=
4
,
s
′
=
terminal
a
=
n
v
1
(
s
′
=
terminal
)
)
+
=
π
(
a
=
w
∣
s
=
4
)
∗
(
R
s
=
4
a
=
w
+
P
s
=
4
,
s
′
=
4
a
=
w
v
1
(
s
′
=
4
)
)
+
=
π
(
a
=
s
∣
s
=
4
)
∗
(
R
s
=
4
a
=
s
+
P
s
=
4
,
s
′
=
8
a
=
s
v
1
(
s
′
=
8
)
)
+
=
π
(
a
=
e
∣
s
=
4
)
∗
(
R
s
=
4
a
=
e
+
P
s
=
4
,
s
′
=
5
a
=
e
v
1
(
s
′
=
5
)
)
+
=
0.25
∗
(
−
1
+
0
)
+
0.25
∗
(
−
1
−
1
)
+
0.25
∗
(
−
1
−
1
)
+
0.25
∗
(
−
1
−
1
)
=
−
1.75
\begin{aligned} \operatorname{v}_2(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_1(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_1(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-1.75 \\ \end{aligned}
v2(s=4)=π(a=n∣s=4)∗(Rs=4a=n+Ps=4,s′=terminala=nv1(s′=terminal))+=π(a=w∣s=4)∗(Rs=4a=w+Ps=4,s′=4a=wv1(s′=4))+=π(a=s∣s=4)∗(Rs=4a=s+Ps=4,s′=8a=sv1(s′=8))+=π(a=e∣s=4)∗(Rs=4a=e+Ps=4,s′=5a=ev1(s′=5))+=0.25∗(−1+0)+0.25∗(−1−1)+0.25∗(−1−1)+0.25∗(−1−1)=−1.75
v
2
(
s
=
8
)
=
π
(
a
=
n
∣
s
=
8
)
∗
(
R
s
=
8
a
=
n
+
P
s
=
8
,
s
′
=
4
a
=
n
v
1
(
s
′
=
4
)
)
+
=
π
(
a
=
w
∣
s
=
8
)
∗
(
R
s
=
8
a
=
w
+
P
s
=
8
,
s
′
=
8
a
=
w
v
1
(
s
′
=
8
)
)
+
=
π
(
a
=
s
∣
s
=
8
)
∗
(
R
s
=
8
a
=
s
+
P
s
=
8
,
s
′
=
12
a
=
s
v
1
(
s
′
=
12
)
)
+
=
π
(
a
=
e
∣
s
=
8
)
∗
(
R
s
=
8
a
=
e
+
P
s
=
8
,
s
′
=
9
a
=
e
v
1
(
s
′
=
9
)
)
+
=
0.25
∗
(
−
1
−
1
)
+
0.25
∗
(
−
1
−
1
)
+
0.25
∗
(
−
1
−
1
)
+
0.25
∗
(
−
1
−
1
)
=
−
2
\begin{aligned} \operatorname{v}_2(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_1(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_1(s'=9) \right) + \\ &= 0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-2 \end{aligned}
v2(s=8)=π(a=n∣s=8)∗(Rs=8a=n+Ps=8,s′=4a=nv1(s′=4))+=π(a=w∣s=8)∗(Rs=8a=w+Ps=8,s′=8a=wv1(s′=8))+=π(a=s∣s=8)∗(Rs=8a=s+Ps=8,s′=12a=sv1(s′=12))+=π(a=e∣s=8)∗(Rs=8a=e+Ps=8,s′=9a=ev1(s′=9))+=0.25∗(−1−1)+0.25∗(−1−1)+0.25∗(−1−1)+0.25∗(−1−1)=−2
For k=3: e.g.
v
3
(
s
=
4
)
=
π
(
a
=
n
∣
s
=
4
)
∗
(
R
s
=
4
a
=
n
+
P
s
=
4
,
s
′
=
terminal
a
=
n
v
2
(
s
′
=
terminal
)
)
+
=
π
(
a
=
w
∣
s
=
4
)
∗
(
R
s
=
4
a
=
w
+
P
s
=
4
,
s
′
=
4
a
=
w
v
2
(
s
′
=
4
)
)
+
=
π
(
a
=
s
∣
s
=
4
)
∗
(
R
s
=
4
a
=
s
+
P
s
=
4
,
s
′
=
8
a
=
s
v
2
(
s
′
=
8
)
)
+
=
π
(
a
=
e
∣
s
=
4
)
∗
(
R
s
=
4
a
=
e
+
P
s
=
4
,
s
′
=
5
a
=
e
v
2
(
s
′
=
5
)
)
+
=
0.25
∗
(
−
1
+
0
)
+
0.25
∗
(
−
1
−
1.75
)
+
0.25
∗
(
−
1
−
2
)
+
0.25
∗
(
−
1
−
2
)
=
−
2.4375
\begin{aligned} \operatorname{v}_3(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_2(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_2(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)=-2.4375 \\ \end{aligned}
v3(s=4)=π(a=n∣s=4)∗(Rs=4a=n+Ps=4,s′=terminala=nv2(s′=terminal))+=π(a=w∣s=4)∗(Rs=4a=w+Ps=4,s′=4a=wv2(s′=4))+=π(a=s∣s=4)∗(Rs=4a=s+Ps=4,s′=8a=sv2(s′=8))+=π(a=e∣s=4)∗(Rs=4a=e+Ps=4,s′=5a=ev2(s′=5))+=0.25∗(−1+0)+0.25∗(−1−1.75)+0.25∗(−1−2)+0.25∗(−1−2)=−2.4375
v
3
(
s
=
8
)
=
π
(
a
=
n
∣
s
=
8
)
∗
(
R
s
=
8
a
=
n
+
P
s
=
8
,
s
′
=
4
a
=
n
v
2
(
s
′
=
4
)
)
+
=
π
(
a
=
w
∣
s
=
8
)
∗
(
R
s
=
8
a
=
w
+
P
s
=
8
,
s
′
=
8
a
=
w
v
2
(
s
′
=
8
)
)
+
=
π
(
a
=
s
∣
s
=
8
)
∗
(
R
s
=
8
a
=
s
+
P
s
=
8
,
s
′
=
12
a
=
s
v
2
(
s
′
=
12
)
)
+
=
π
(
a
=
e
∣
s
=
8
)
∗
(
R
s
=
8
a
=
e
+
P
s
=
8
,
s
′
=
9
a
=
e
v
2
(
s
′
=
9
)
)
+
=
0.25
∗
(
−
1
−
1.75
)
+
0.25
∗
(
−
1
−
2
)
+
0.25
∗
(
−
1
−
2
)
+
0.25
∗
(
−
1
−
2
)
=
−
2.9375
\begin{aligned} \operatorname{v}_3(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_2(s'=9) \right) + \\ &= 0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-2.9375 \\ \end{aligned}
v3(s=8)=π(a=n∣s=8)∗(Rs=8a=n+Ps=8,s′=4a=nv2(s′=4))+=π(a=w∣s=8)∗(Rs=8a=w+Ps=8,s′=8a=wv2(s′=8))+=π(a=s∣s=8)∗(Rs=8a=s+Ps=8,s′=12a=sv2(s′=12))+=π(a=e∣s=8)∗(Rs=8a=e+Ps=8,s′=9a=ev2(s′=9))+=0.25∗(−1−1.75)+0.25∗(−1−2)+0.25∗(−1−2)+0.25∗(−1−2)=−2.9375
v
3
(
s
=
12
)
=
π
(
a
=
n
∣
s
=
12
)
∗
(
R
s
=
12
a
=
n
+
P
s
=
12
,
s
′
=
8
a
=
n
v
2
(
s
′
=
8
)
)
+
=
π
(
a
=
w
∣
s
=
12
)
∗
(
R
s
=
12
a
=
w
+
P
s
=
12
,
s
′
=
12
a
=
w
v
2
(
s
′
=
12
)
)
+
=
π
(
a
=
s
∣
s
=
12
)
∗
(
R
s
=
12
a
=
s
+
P
s
=
12
,
s
′
=
12
a
=
s
v
2
(
s
′
=
12
)
)
+
=
π
(
a
=
e
∣
s
=
12
)
∗
(
R
s
=
12
a
=
e
+
P
s
=
12
,
s
′
=
13
a
=
e
v
2
(
s
′
=
13
)
)
+
=
0.25
∗
(
−
1
−
2
)
+
0.25
∗
(
−
1
−
2
)
+
0.25
∗
(
−
1
−
2
)
+
0.25
∗
(
−
1
−
2
)
=
−
3.0
\begin{aligned} \operatorname{v}_3(s=12) &= \pi(a=n \mid s=12)*\left(R_{s=12}^{a=n} + \mathcal{P}_{s=12,s'=8}^{a=n}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=w \mid s=12)*\left(R_{s=12}^{a=w} + \mathcal{P}_{s=12,s'=12}^{a=w} \operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=s \mid s=12)*\left(R_{s=12}^{a=s} + \mathcal{P}_{s=12,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=12)*\left(R_{s=12}^{a=e} + \mathcal{P}_{s=12,s'=13}^{a=e}\operatorname{v}_2(s'=13) \right) + \\ &= 0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-3.0 \end{aligned}
v3(s=12)=π(a=n∣s=12)∗(Rs=12a=n+Ps=12,s′=8a=nv2(s′=8))+=π(a=w∣s=12)∗(Rs=12a=w+Ps=12,s′=12a=wv2(s′=12))+=π(a=s∣s=12)∗(Rs=12a=s+Ps=12,s′=12a=sv2(s′=12))+=π(a=e∣s=12)∗(Rs=12a=e+Ps=12,s′=13a=ev2(s′=13))+=0.25∗(−1−2)+0.25∗(−1−2)+0.25∗(−1−2)+0.25∗(−1−2)=−3.0
Policy Iteration
Algorithm:
-
Given a policy π \pi π
-
Loop forever until stopping condition
-
Evaluate the policy π \pi π
v π ( s ) = E [ R t + 1 + γ R t + 2 + … ∣ S t = s ] \operatorname{v}_\pi(s) = \mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+ \ldots \mid S_t =s \right] vπ(s)=E[Rt+1+γRt+2+…∣St=s] -
Improve the policy by acting greedily with respect to v π \operatorname{v}_\pi vπ
π ′ = greedy ( v π ) \pi' = \operatorname{greedy}(\operatorname{v}_\pi) π′=greedy(vπ)
-
Policy improvement
Proof of why π ′ ≥ π \pi' \geq \pi π′≥π when use greedy:
-
Consider a deterministic policy, a = π ( s ) a = \pi(s) a=π(s)
-
improve the policy by acting greedily
π ′ ( s ) = a r g m a x a ∈ A q π ( s , a ) \pi'(s) = \underset{a \in \mathcal{A}}{arg\,max}\ q_\pi(s,a) π′(s)=a∈Aargmax qπ(s,a) -
This improves the value from any state s over one step
q π ( s , π ′ ( s ) ) = max a ∈ A q π ( s , a ) ≥ q π ( s , π ( s ) ) = v π ( s ) ∀ s q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}\geq} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s qπ(s,π′(s))=a∈Amax qπ(s,a) ≥ qπ(s,π(s))=vπ(s)∀ s -
It therefore improves the value function, v π ′ ( s ) ≥ v π ( s ) \operatorname{v}_{\pi'}(s) \geq \operatorname{v}_\pi(s) vπ′(s)≥vπ(s)
v π ( s ) ≤ q π ( s , π ′ ( s ) ) = E π ′ [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] ≤ E π ′ [ R t + 1 + γ q π ( S t + 1 , π ′ ( S t + 1 ) ) ∣ S t = s ] because step 3. ∀ s ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , π ′ ( S t + 2 ) ) ∣ S t = s ] ≤ ⋯ ≤ E π ′ [ R t + 1 + γ R t + 2 + ⋯ ∣ S t = s ] = v π ′ ( s ) \begin{aligned} \operatorname{v}_\pi(s) &\leq q_\pi\left(s,\pi'(s) \right) = \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma \operatorname{v}_{\color{blue}\pi} \left(S_{t+1} \right) \mid S_t = s \right] \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma q_{\color{blue}\pi} \left(S_{t+1}, \pi'(S_{t+1}) \right) \mid S_t = s \right] \qquad \text{because step 3.}\ \forall \ s \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 q_{\color{blue}\pi} \left(S_{t+2}, \pi'(S_{t+2}) \right) \mid S_t = s \right] \\ &\leq \dots \leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s \right] = \operatorname{v}_{\pi'}(s) \end{aligned} vπ(s)≤qπ(s,π′(s))=Eπ′[Rt+1+γvπ(St+1)∣St=s]≤Eπ′[Rt+1+γqπ(St+1,π′(St+1))∣St=s]because step 3. ∀ s≤Eπ′[Rt+1+γRt+2+γ2qπ(St+2,π′(St+2))∣St=s]≤⋯≤Eπ′[Rt+1+γRt+2+⋯∣St=s]=vπ′(s)
Why π ′ {\color{red}\pi'} π′ and π {\color{blue}\pi} π ?得回到 action-value function 的定义来看待这个问题
q π ( s , a ) = E π [ R t + 1 + γ v π ( s t + 1 ) ] = ∑ a ∈ A π ( a ∣ s ) ( R t + 1 + γ v π ( s t + 1 ) ) q_\pi(s,a) = \mathbb{E}_\pi \left[R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right] = \sum_{a \in \mathcal{A}}{\color{red}\pi(a \mid s)} \left(R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right) qπ(s,a)=Eπ[Rt+1+γvπ(st+1)]=a∈A∑π(a∣s)(Rt+1+γvπ(st+1))
因此可以看出求期望时,即加权平均值的 加权为 π ( a ∣ s ) \pi(a \mid s) π(a∣s) 和动作策略是有关的,当我们使用了 greedy 之后,每一个状态下各动作选取的概率会改变,而 state-value funtion 还是由 π \pi π 算出来的,没进行更新。所以有 π ′ {\color{red}\pi'} π′ and π {\color{blue}\pi} π 的区别
Proof of why converge to π ∗ \pi^* π∗:
-
If improvements stop
q π ( s , π ′ ( s ) ) = max a ∈ A q π ( s , a ) = q π ( s , π ( s ) ) = v π ( s ) ∀ s q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}=} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s qπ(s,π′(s))=a∈Amax qπ(s,a) = qπ(s,π(s))=vπ(s)∀ s -
Then the Bellman optimality equation has been satisfied
v π ( s ) = max a ∈ A q π ( s , a ) \operatorname{v}_\pi(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) vπ(s)=a∈Amax qπ(s,a) -
Therefore now v π ( s ) = v ∗ ( s ) ∀ s ∈ S \operatorname{v}_\pi(s) = \operatorname{v}_*(s) \quad \forall s \in \mathcal{S} vπ(s)=v∗(s)∀s∈S
-
so π \pi π is an optimal policy
Value Iteration
Principle of Optimality
A policy π ( a ∣ s ) \pi(a \mid s) π(a∣s) achieves the optimal value from state s, i.e. v π ( s ) = v ∗ ( s ) \operatorname{v}_\pi(s) = \operatorname{v}_*(s) vπ(s)=v∗(s), if and only if
- for any state s’ reachable from s
- π \pi π achieves the optimal value from state s’, i.e. v π ( s ′ ) = v ∗ ( s ′ ) \operatorname{v}_\pi(s') = \operatorname{v}_*(s') vπ(s′)=v∗(s′)
Deterministic Value Iteration
Compare to Iterative Policy Evaluation, please click here, actually the difference between them is how to update (max or expectation)
-
If we know the solution to subproblems v ∗ ( s ′ ) \operatorname{v}_*(s') v∗(s′)
-
Then solution v ∗ ( s ′ ) \operatorname{v}_*(s') v∗(s′) can be found by one-step lookhead
v ∗ ( s ) ← max a ∈ A ( R S a + γ ∑ s ′ ∈ S P s s ′ a v ∗ ( s ′ ) ) \operatorname{v}_*(s) \leftarrow \underset{a \in \mathcal{A}}{\operatorname{max}} \left( \mathcal{R}_S^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_*(s') \right) v∗(s)←a∈Amax(RSa+γs′∈S∑Pss′av∗(s′)) -
The idea of value iteration is to apply these updates iteratively
-
Intuition: start with final rewards and work backwards
Iterative application of [ Bellman optimality backup \text{Bellman {\color{red}optimality} backup} Bellman optimality backup](#Bellman optimality backup): v 1 → v 2 → … → v ∗ \operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_* v1→v2→…→v∗
- Using Synchornous backups:
- at each iteration k + 1 k+1 k+1
- for all states s ∈ S s \in \mathcal{S} s∈S
- update v k + 1 ( s ) \operatorname{v}_{k+1}(s) vk+1(s) from v k ( s ′ ) \operatorname{v}_{k}(s') vk(s′)
- Intermediate value function may not correspond to any policy
How to update:
v
k
+
1
(
s
)
=
max
a
∈
A
(
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
v
k
(
s
′
)
)
\operatorname{v}_{\color{red}{k+1}}(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right)
vk+1(s)=a∈Amax (Rsa+γs′∈S∑Pss′avk(s′))
Matrix From:
v
k
+
1
=
max
a
∈
A
(
R
a
+
γ
P
a
v
k
)
\mathbf{v}_{k+1} = \underset{a \in \mathcal{A}}{\operatorname{max}}\left( \mathcal{R}^{\mathbf{a}} + \gamma \mathcal{P}^{\mathbf{a}} \operatorname{v}_k \right)
vk+1=a∈Amax(Ra+γPavk)
A live demo
GridWorld: Dynamic Programming Demo
Summary of DP Algorithms
Problem | bEllman equation | Algorithm |
---|---|---|
Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |
Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |
Control | Bellman Optimality Equation | Value Iteration |
- Algorithms are based on state-value function v π ( s ) \operatorname{v}_\pi(s) vπ(s) or v ∗ ( s ) \operatorname{v}_*(s) v∗(s)
- m actions and n states, so complexity = O(m*n(s)*n(s’) ) = O(mn2) per iteration