【Value iteration algorithm(值迭代算法)】
v k + 1 = f ( v k ) = max π ( r π + γ P π v k ) , k = 1 , 2 , 3 … v_{k+1}=f\left(v_k\right)=\max _\pi\left(r_\pi+\gamma P_\pi v_k\right), \quad k=1,2,3 \ldots vk+1=f(vk)=πmax(rπ+γPπvk),k=1,2,3…
-
第一步(policy update):当 v k v_k vk 给出的时候解决 π k + 1 \pi_{k+1} πk+1
π k + 1 = arg max π ( r π + γ P π v k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_k\right) πk+1=argπmax(rπ+γPπvk) -
第二步(value update):利用当前的 π k + 1 \pi_{k+1} πk+1 解决 v k + 1 v_{k+1} vk+1
v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk
问题: v k v_k vk 是不是state value?
回答:不是 v k = r π k + 1 + γ P π k + 1 v k v_{k}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk=rπk+1+γPπk+1vk 这个公式才是贝尔曼公式是state value,但这里只是一个值用来进行迭代趋近的
✨policy update:
π k + 1 = arg max π ( r π + γ P π v k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_k\right) πk+1=argπmax(rπ+γPπvk)
π k + 1 ( s ) = arg max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) , s ∈ S \pi_{k+1}(s)=\arg \max _\pi \sum_a \pi(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right)}_{q_k(s, a)}, s \in \mathcal{S} πk+1(s)=argπmaxa∑π(a∣s)qk(s,a) (r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vk(s′)),s∈S
a
k
∗
(
s
)
=
arg
max
a
q
k
(
a
,
s
)
a_k^*(s)=\arg \max _a q_k(a, s)
ak∗(s)=argmaxaqk(a,s)时其最优解为:
π
k
+
1
(
a
∣
s
)
=
{
1
a
=
a
k
∗
(
s
)
0
a
≠
a
k
∗
(
s
)
\pi_{k+1}(a \mid s)=\left\{\begin{array}{cc} 1 & a=a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{array}\right.
πk+1(a∣s)={10a=ak∗(s)a=ak∗(s)
由于它只选择最优的q-value所以
π
k
+
1
\pi_{k+1}
πk+1 叫贪婪策略
✨value update:
v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk
v k + 1 ( s ) = ∑ a π k + 1 ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) , s ∈ S v_{k+1}(s)=\sum_a \pi_{k+1}(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right)\right)}_{q_k(s, a)}, \quad s \in \mathcal{S} vk+1(s)=a∑πk+1(a∣s)qk(s,a) (r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vk(s′)),s∈S
由于 π k + 1 \pi_{k+1} πk+1 为贪婪策略,所以 v k + 1 ( s ) = max a q k ( a , s ) v_{k+1}(s)=\max _a q_k(a, s) vk+1(s)=maxaqk(a,s)
✨Value iteration algorithm伪代码:
v k ( s ) → q k ( s , a ) → greedy policy π k + 1 ( a ∣ s ) → new value v k + 1 = max a q k ( s , a ) v_k(s) \rightarrow q_k(s, a) \rightarrow \text { greedy policy } \pi_{k+1}(a \mid s) \rightarrow \text { new value } v_{k+1}=\max _a q_k(s, a) vk(s)→qk(s,a)→ greedy policy πk+1(a∣s)→ new value vk+1=amaxqk(s,a)
伪代码:
- 初始化:对所有的 ( s , a ) (s,a) (s,a) 有 p ( r ∣ s , a ) p(r \mid s, a) p(r∣s,a) 和 p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) p(s′∣s,a)
- 目标:寻找最优的 state value 和 policy value 来解决贝尔曼最优公式
- 过程:假设 ∥ v k − v k − 1 ∥ \left\|v_k-v_{k-1}\right\| ∥vk−vk−1∥未小于最小值
- 遍历所有的状态 s s s,对于每个状态 s ∈ S s \in \mathcal{S} s∈S
- 对于每个动作 a ∈ A ( s ) a \in \mathcal{A}(s) a∈A(s),计算q-value: q k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) q_k(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right) qk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)
- 选择最大的action value: a k ∗ ( s ) = arg max a q k ( a , s ) a_k^*(s)=\arg \max _a q_k(a, s) ak∗(s)=argmaxaqk(a,s)
- policy update: π k + 1 ( a ∣ s ) = 1 \pi_{k+1}(a \mid s)=1 πk+1(a∣s)=1 if a = a k ∗ a=a_k^* a=ak∗, and π k + 1 ( a ∣ s ) = 0 \pi_{k+1}(a \mid s)=0 πk+1(a∣s)=0
- value update: v k + 1 ( s ) = max a q k ( a , s ) v_{k+1}(s)=\max _a q_k(a, s) vk+1(s)=maxaqk(a,s)
理解:首先对每个状态计算q-value,之后选择最大的action value表明我知道了如何行动。然后更新policy 再之后更新 value 然后一直重复直到找到最优解 。
✨例子:
r boundary = r forbidden = − 1 , r target = 1 , γ = 0.9 r_{\text {boundary }}=r_{\text {forbidden }}=-1, r_{\text {target }}=1,\gamma=0.9 rboundary =rforbidden =−1,rtarget =1,γ=0.9
【Policy iteration algorithm(策略迭代算法)】
π 0 ⟶ P E v π 0 ⟶ P I π 1 ⟶ P E v π 1 ⟶ P I π 2 ⟶ P E v π 2 ⟶ P I … \pi_0 \stackrel{P E}{\longrightarrow} v_{\pi_0} \stackrel{P I}{\longrightarrow} \pi_1 \stackrel{P E}{\longrightarrow} v_{\pi_1} \stackrel{P I}{\longrightarrow} \pi_2 \stackrel{P E}{\longrightarrow} v_{\pi_2} \stackrel{P I}{\longrightarrow} \ldots π0⟶PEvπ0⟶PIπ1⟶PEvπ1⟶PIπ2⟶PEvπ2⟶PI…
-
初始化:随机给个策略 π 0 \pi_0 π0
-
第一步(policy evaluation):求解贝尔曼公式得到state value看策略如何
v π k = r π k + γ P π k v π k v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k} vπk=rπk+γPπkvπk -
第二步(policy improvement):通过优化改变其策略为 π k + 1 \pi_{k+1} πk+1
π k + 1 = arg max π ( r π + γ P π v π k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right) πk+1=argπmax(rπ+γPπvπk)
✨policy evaluation:
v π k ( j + 1 ) = r π k + γ P π k v π k ( j ) , j = 0 , 1 , 2 , … v_{\pi_k}^{(j+1)}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k}^{(j)}, \quad j=0,1,2, \ldots vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,…
v π k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ) , s ∈ S v_{\pi_k}^{(j+1)}(s)=\sum_a \pi_k(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}^{(j)}\left(s^{\prime}\right)\right), \quad s \in \mathcal{S} vπk(j+1)(s)=a∑πk(a∣s)(r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπk(j)(s′)),s∈S
✨policy improvement:
π k + 1 = arg max π ( r π + γ P π v π k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_{\pi} v_{\pi_k}\right) πk+1=argπmax(rπ+γPπvπk)
π k + 1 ( s ) = arg max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( s ′ ) ) ⏟ q π k ( s , a ) , s ∈ S . \pi_{k+1}(s)=\arg \max _\pi \sum_a \pi(a \mid s) \underbrace{\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}\left(s^{\prime}\right)\right)}_{q_{\pi_k}(s, a)}, \quad s \in \mathcal{S} \text {. } πk+1(s)=argπmaxa∑π(a∣s)qπk(s,a) (r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vπk(s′)),s∈S.
q
π
k
(
s
,
a
)
q_{\pi_k}(s, a)
qπk(s,a)是策略
π
k
\pi_k
πk 下的 action value:
a
k
∗
(
s
)
=
arg
max
a
q
π
k
(
a
,
s
)
a_k^*(s)=\arg \max _a q_{\pi_k}(a, s)
ak∗(s)=argamaxqπk(a,s)
其贪婪策略是:
π
k
+
1
(
a
∣
s
)
=
{
1
a
=
a
k
∗
(
s
)
0
a
≠
a
k
∗
(
s
)
\pi_{k+1}(a \mid s)=\left\{\begin{array}{cl} 1 & a=a_k^*(s) \\ 0 & a \neq a_k^*(s) \end{array}\right.
πk+1(a∣s)={10a=ak∗(s)a=ak∗(s)
✨Policy iteration algorithm伪代码:
- 初始化:对所有的 ( s , a ) (s,a) (s,a) 有 p ( r ∣ s , a ) p(r \mid s, a) p(r∣s,a) 和 p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) p(s′∣s,a),初始化 π 0 \pi_0 π0
- 目标:寻找最优的state value 和 optimal policy
- 假设policy还没收敛,对于第k次:
- policy evaluation:
- 给定初始猜测 v π k ( 0 ) v_{\pi_k}^{(0)} vπk(0)
- 对于每个状态 s ∈ S s \in \mathcal{S} s∈S, v π k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( j ) ( s ′ ) ] v_{\pi_k}^{(j+1)}(s)=\sum_a \pi_k(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}^{(j)}\left(s^{\prime}\right)\right] vπk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(j)(s′)]
- policy improvement:
- 遍历每个状态
s
∈
S
s \in \mathcal{S}
s∈S
- 遍历每个action, a ∈ A ( s ) a \in \mathcal{A}(s) a∈A(s): q π k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π k ( s ′ ) q_{\pi_k}(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_{\pi_k}\left(s^{\prime}\right) qπk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vπk(s′)
- 选择最大的 a k ∗ ( s ) = arg max a q π k ( s , a ) a_k^*(s)=\arg \max _a q_{\pi_k}(s, a) ak∗(s)=argmaxaqπk(s,a)
- π k + 1 ( a ∣ s ) = 1 \pi_{k+1}(a \mid s)=1 πk+1(a∣s)=1 if a = a k ∗ a=a_k^* a=ak∗, and π k + 1 ( a ∣ s ) = 0 \pi_{k+1}(a \mid s)=0 πk+1(a∣s)=0
- 遍历每个状态
s
∈
S
s \in \mathcal{S}
s∈S
- policy evaluation:
✨例子:
r boundary = − 1 , r target = 1 , γ = 0.9 r_{\text {boundary }}=-1, r_{\text {target }}=1,\gamma=0.9 rboundary =−1,rtarget =1,γ=0.9
-
actions: a ℓ , a 0 , a r a_{\ell}, a_0, a_r aℓ,a0,ar分别表示向左、原地、向右
-
目标:寻找最优policy
【Truncated policy iteration algorithm(截断策略迭代算法)】
✨value iteration 和 policy iteration比较:
policy iteration:
Policy iteration: π 0 ⟶ P E v π 0 ⟶ P I π 1 ⟶ P E v π 1 ⟶ P I π 2 ⟶ P E v π 2 ⟶ P I … \pi_0 \stackrel{P E}{\longrightarrow} v_{\pi_0} \stackrel{P I}{\longrightarrow} \pi_1 \stackrel{P E}{\longrightarrow} v_{\pi_1} \stackrel{P I}{\longrightarrow} \pi_2 \stackrel{P E}{\longrightarrow} v_{\pi_2} \stackrel{P I}{\longrightarrow} \ldots π0⟶PEvπ0⟶PIπ1⟶PEvπ1⟶PIπ2⟶PEvπ2⟶PI…
-
policy evaluation(PE):
v π k = r π k + γ P π k v π k v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k} v_{\pi_k} vπk=rπk+γPπkvπk -
policy improvement(PI):
π k + 1 = arg max π ( r π + γ P π v π k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_{\pi_k}\right) πk+1=argπmax(rπ+γPπvπk)
value iteration:
Value iteration: u 0 ⟶ P U π 1 ′ ⟶ V U u 1 ⟶ P U π 2 ′ ⟶ V U u 2 ⟶ P U … \quad u_0 \stackrel{P U}{\longrightarrow} \pi_1^{\prime} \stackrel{V U}{\longrightarrow} u_1 \stackrel{P U}{\longrightarrow} \pi_2^{\prime} \stackrel{V U}{\longrightarrow} u_2 \stackrel{P U}{\longrightarrow} \ldots u0⟶PUπ1′⟶VUu1⟶PUπ2′⟶VUu2⟶PU…
-
policy update(PU):
π k + 1 = arg max π ( r π + γ P π v k ) \pi_{k+1}=\arg \max _\pi\left(r_\pi+\gamma P_\pi v_k\right) πk+1=argπmax(rπ+γPπvk) -
value update(VU):
v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk
-
如图所示其前三部都是一样的。
-
第四步不一样:
-
policy iteration: v π 1 = r π 1 + γ P π 1 v π 1 v_{\pi_1}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1} vπ1=rπ1+γPπ1vπ1,需要进行内部迭代计算(贝尔曼公式迭代算法)
v π 1 ( 0 ) = v 0 v π 1 ( 1 ) = r π 1 + γ P π 1 v π 1 ( 0 ) v π 1 ( 2 ) = r π 1 + γ P π 1 v π 1 ( 1 ) ⋮ v π 1 ( j ) = r π 1 + γ P π 1 v π 1 ( j − 1 ) ⋮ v π 1 ( ∞ ) = r π 1 + γ P π 1 v π 1 ( ∞ ) \begin{aligned} & v_{\pi_1}^{(0)}=v_0 \\ & v_{\pi_1}^{(1)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(0)} \\ & v_{\pi_1}^{(2)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(1)} \\ & \vdots \\ & v_{\pi_1}^{(j)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(j-1)} \\ & \vdots \\ & v_{\pi_1}^{(\infty)}=r_{\pi_1}+\gamma P_{\pi_1} v_{\pi_1}^{(\infty)} \end{aligned} vπ1(0)=v0vπ1(1)=rπ1+γPπ1vπ1(0)vπ1(2)=rπ1+γPπ1vπ1(1)⋮vπ1(j)=rπ1+γPπ1vπ1(j−1)⋮vπ1(∞)=rπ1+γPπ1vπ1(∞) -
value iteration: v 1 = r π 1 + γ P π 1 v 0 v_1=r_{\pi_1}+\gamma P_{\pi_1} v_0 v1=rπ1+γPπ1v0,只需要进行一部(正常计算)
-
✨Truncated policy iteration伪代码:
- 初始化:对所有的 ( s , a ) (s,a) (s,a) 有 p ( r ∣ s , a ) p(r \mid s, a) p(r∣s,a) 和 p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) p(s′∣s,a),初始化 π 0 \pi_0 π0
- 目标:寻找最优的state value 和 optimal policy
- 假设policy还没收敛,对于第k次:
- policy evaluation:
- 初始化: v k ( 0 ) = v k − 1 v_k^{(0)}=v_{k-1} vk(0)=vk−1,最大迭代次数为 j t r u n c a t e j_{truncate} jtruncate
- 当
j
<
j
t
r
u
n
c
a
t
e
j < j_{truncate}
j<jtruncate
- 对每个状态 s ∈ S s \in \mathcal{S} s∈S: v k ( j + 1 ) ( s ) = ∑ a π k ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( j ) ( s ′ ) ] v_k^{(j+1)}(s)=\sum_a \pi_k(a \mid s)\left[\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k^{(j)}\left(s^{\prime}\right)\right] vk(j+1)(s)=∑aπk(a∣s)[∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(j)(s′)]
- 设 v k = v k ( j truncate ) v_k=v_k^{\left(j_{\text {truncate }}\right)} vk=vk(jtruncate )
- policy improvement:
- 对每个状态
s
∈
S
s \in \mathcal{S}
s∈S
- 对每个action a ∈ A ( s ) a \in \mathcal{A}(s) a∈A(s): q k ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) q_k(s, a)=\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_k\left(s^{\prime}\right) qk(s,a)=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)vk(s′)
- a k ∗ ( s ) = arg max a q k ( s , a ) a_k^*(s)=\arg \max _a q_k(s, a) ak∗(s)=argmaxaqk(s,a)
- π k + 1 ( a ∣ s ) = 1 \pi_{k+1}(a \mid s)=1 πk+1(a∣s)=1 if a = a k ∗ a=a_k^* a=ak∗, 并且 π k + 1 ( a ∣ s ) = 0 \pi_{k+1}(a \mid s)=0 πk+1(a∣s)=0
- 对每个状态
s
∈
S
s \in \mathcal{S}
s∈S
- policy evaluation:
通过这个图能看出来三种的效果