section 2
2.1
设 两个行动分别为
a
1
,
a
2
a_1,a_2
a1,a2,
Q
t
(
a
1
)
>
Q
t
(
a
2
)
Q_t(a_1) > Q_t(a_2)
Qt(a1)>Qt(a2),即行动
a
1
a_1
a1为greedy action
P
(
A
t
=
a
1
)
P(A_t = a_1)
P(At=a1) =
1
⋅
(
1
−
ε
)
1 \cdot (1- \varepsilon)
1⋅(1−ε)+
1
⋅
ε
⋅
0.5
1 \cdot \varepsilon \cdot0.5
1⋅ε⋅0.5=0.75
2.2
A 5 A_5 A5 必定是 ε \varepsilon ε 发生的time step 剩余的都是可能发生的time step
2.3
定量来看的的话累计奖励就是平均奖励的积分,1000steps来看的明显是 ε = 0.1 \varepsilon = 0.1 ε=0.1的时候大,当时间趋于无穷时 ε = 0.01 \varepsilon=0.01 ε=0.01更大,因为在找到最优行动后无效探索次数要比 ε = 0.1 \varepsilon=0.1 ε=0.1时小小十倍。
2.4
Q n + 1 = Q n + α n ( R n − Q n ) = ( 1 − α n ) Q n + α R n = ( 1 − α n ) ( 1 − α n − 1 ) Q n − 1 + α n − 1 ( 1 − α n ) ( 1 − α n − 1 ) R n − 1 + α n R n = ( 1 − α n ) . . . ( 1 − α 1 ) Q 1 + α 1 ( 1 − α n ) . . . ( 1 − α 1 ) R 1 + . . . + α n R n = ∏ i = 1 n ( 1 − α i ) Q 1 + ∑ k = 1 n α k R k ∏ i = k n ( 1 − α i ) \begin{aligned} Q_{n+1} &= Q_n +\alpha_n(R_n-Q_n)=(1-\alpha_n)Q_n+\alpha R_n \\ &=(1-\alpha_n)(1-\alpha_{n-1})Q_{n-1}+\alpha_{n-1} (1-\alpha_n)(1-\alpha_{n-1})R_{n-1}+\alpha_nR_n \\ &=(1-\alpha_n)...(1-\alpha_1)Q_1+\alpha_1(1-\alpha_n)...(1-\alpha_1)R_1+...+\alpha_nR_n \\ &=\prod^{n}_{i=1}(1-\alpha_i)Q_1+\sum^{n}_{k=1}\alpha_kR_k\prod_{i=k}^{n}(1-\alpha_i) \end{aligned} Qn+1=Qn+αn(Rn−Qn)=(1−αn)Qn+αRn=(1−αn)(1−αn−1)Qn−1+αn−1(1−αn)(1−αn−1)Rn−1+αnRn=(1−αn)...(1−α1)Q1+α1(1−αn)...(1−α1)R1+...+αnRn=i=1∏n(1−αi)Q1+k=1∑nαkRki=k∏n(1−αi)
2.5
https://github.com/JTBBB-J/rl_learn.git
2.6
尖刺情况出现在早期,应该就是在全部行动都采取过一次后bestaction的estimated value最大然后执行greedy action(由于初期所以estimated value受到的影响大),这一次执行后又导致estimated value 降低执行其他行动。
2.7
不会
2.8
前十次会把所有行动再试一次,然后第十一次由于N都为1所以必定执行greedy action,之后由于c比较大则会平凡选择那些选择次数少的行动,也就是会倾向于探索所以会导致value猛然下降。
2.9
一除分子就行
3.6
G t = { 0 − δ T − 1 G_t = \begin{cases} 0& \\ -\delta_{T-1}& \end{cases} Gt={0−δT−1
3.7
机器人不知道跑出迷宫会有奖励,且呆在迷宫中没有负的奖励,所以会导致机器人一直呆在迷宫中不会去找出口
3.8
带入公式就行
3.9
G
0
=
2
+
7
δ
+
7
δ
2
+
.
.
.
.
=
2
+
7
(
1
1
−
δ
−
1
)
=
65
G_0 = 2 + 7\delta + 7\delta^2+.... = 2+7(\frac{1}{1-\delta}-1)=65
G0=2+7δ+7δ2+....=2+7(1−δ1−1)=65
G
1
G_1
G1就是少了第一项所以就是63
3.10
就是级数算法
3.11
E [ R t + 1 ] = ∑ a π ( a ∣ s ) ∑ r ∈ R r t + 1 ∑ s ∈ s ′ P ( r t + 1 , s t + 1 ∣ s t , a t ) E[R_{t+1}] = \sum_{a}\pi(a|s)\sum_{r \in R}r_{t+1}\sum_{s\in s'}P(r_{t+1},s_{t+1}|s_t,a_t) E[Rt+1]=a∑π(a∣s)r∈R∑rt+1s∈s′∑P(rt+1,st+1∣st,at)
3.12
V π ( s ) = E [ G t ∣ S t = s ] = ∑ G t P ( G t ∣ S t ) = ∑ s G t ( ∑ a P ( A t ∣ S t ) P ( G t ∣ A t , S t ) ) = ∑ a P ( A t ∣ S t ) q π ( s , a ) = ∑ a π ( a ∣ s ) q π ( s , a ) \begin{aligned} V_\pi(s) &= E[G_t|S_t =s] = \sum G_tP(G_t|S_t) \\ &= \sum_s G_t(\sum_a P(A_t|S_t)P(G_t|A_t,S_t)) \\ &= \sum_a P(A_t|S_t) q_\pi(s,a) = \sum_a \pi(a|s)q_\pi(s,a) \end{aligned} Vπ(s)=E[Gt∣St=s]=∑GtP(Gt∣St)=s∑Gt(a∑P(At∣St)P(Gt∣At,St))=a∑P(At∣St)qπ(s,a)=a∑π(a∣s)qπ(s,a)
3.13
不会
3.14
∑
π
(
a
∣
s
)
∑
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
δ
v
π
(
s
′
)
]
=
0.25
(
0.9
∗
0.4
)
+
0.25
(
0.9
∗
2.3
)
+
0.25
(
0.9
∗
0.7
)
+
0.25
(
0.9
∗
(
−
0.4
)
)
=
0.675
\begin{aligned} \sum\pi(a|s)\sum p(s',r|s,a)[r+\delta v_\pi(s')]&=0.25(0.9*0.4)+0.25(0.9*2.3)+0.25(0.9*0.7)+0.25(0.9*(-0.4)) \\ &=0.675 \end{aligned}
∑π(a∣s)∑p(s′,r∣s,a)[r+δvπ(s′)]=0.25(0.9∗0.4)+0.25(0.9∗2.3)+0.25(0.9∗0.7)+0.25(0.9∗(−0.4))=0.675
这里其实没太搞懂为什么不把从b到b’得到+5reward的概率算上,有大佬能解答一下吗
3.15
G
t
′
=
∑
δ
k
(
R
t
+
k
+
1
+
c
)
=
∑
δ
k
(
R
t
+
k
+
1
)
+
∑
δ
k
c
=
∑
δ
k
(
R
t
+
k
+
1
)
+
c
1
−
δ
G_t' = \sum \delta^k(R_{t+k+1}+c)=\sum \delta^k(R_{t+k+1})+\sum\delta^kc=\sum \delta^k(R_{t+k+1})+\frac{c}{1-\delta}
Gt′=∑δk(Rt+k+1+c)=∑δk(Rt+k+1)+∑δkc=∑δk(Rt+k+1)+1−δc
v
π
=
E
[
G
t
′
∣
S
t
=
s
]
=
E
[
G
t
∣
S
t
=
s
]
+
c
1
−
δ
v_\pi=E[Gt'|S_t=s] = E[G_t|S_t=s]+\frac{c}{1-\delta}
vπ=E[Gt′∣St=s]=E[Gt∣St=s]+1−δc
3.17
和
v
π
v_\pi
vπ的bellman 等式推到类似,详情见强化学习 学习记录(2)
q
π
(
s
,
a
)
=
∑
r
,
s
′
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
δ
q
π
(
s
′
,
a
)
]
q_\pi(s,a)=\sum_{r,s'}p(s',r|s,a)[r+\delta q_\pi(s',a)]
qπ(s,a)=r,s′∑p(s′,r∣s,a)[r+δqπ(s′,a)]
3.18
v π = E [ q π ( s , a ) ∣ S t = s , A t = a ] = ∑ π ( a ∣ s ) q π ( s , a ) v_\pi = E[q_\pi(s,a)|S_t=s,A_t=a]=\sum\pi(a|s)q_\pi(s,a) vπ=E[qπ(s,a)∣St=s,At=a]=∑π(a∣s)qπ(s,a)
3.19
q π ( s , a ) = E [ r + v π ( s ′ ) ∣ S t = s ′ , A t = a ] = ∑ s ′ , a P ( s ′ , r ∣ a , s ) ( r + v π ( s ′ ) ) q_\pi(s,a) = E[r+v_\pi(s')|S_t=s',A_t=a] = \sum_{s',a}P(s',r|a,s)(r+v_\pi(s')) qπ(s,a)=E[r+vπ(s′)∣St=s′,At=a]=s′,a∑P(s′,r∣a,s)(r+vπ(s′))
4.1
q
π
(
11
,
d
o
w
n
)
=
∑
s
′
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
δ
v
π
(
s
′
)
]
=
1
∗
(
−
1
+
0
)
=
−
1
q_\pi(11,down) = \sum_{s'}p(s',r|s,a)[r+\delta v_\pi(s')] = 1*(-1+0)=-1
qπ(11,down)=s′∑p(s′,r∣s,a)[r+δvπ(s′)]=1∗(−1+0)=−1
q
π
(
7
,
d
o
w
n
)
q_\pi(7,down)
qπ(7,down)同理,答案为-15
4.2
如果
v
π
(
13
)
v_\pi(13)
vπ(13)不变仍然为-20,则
v
π
(
15
)
=
3
/
4
(
−
1
+
v
π
(
15
)
)
+
1
/
4
(
−
1
−
20
)
v_\pi(15) = 3/4(-1+v_\pi(15))+1/4(-1-20)
vπ(15)=3/4(−1+vπ(15))+1/4(−1−20)
解出来
v
π
(
15
)
=
−
24
v_\pi(15)=-24
vπ(15)=−24
如果
v
π
(
13
)
v_\pi(13)
vπ(13)改变的话,把强化学习 学习记录(3)中的代码改下就能跑出来