系列文章
GFlowNet Foundation 笔记(一)
GFlowNet Foundation 笔记(二)
GFlowNet Foundation 笔记(三)
GFlowNet Foundation 笔记(四)
期望的奖励和奖励最大策略
Def 37. 对于在终止状态上的任意分布
P
π
(
s
)
P_{\pi}(s)
Pπ(s) ,期望奖励( expected reward ) 为
V
P
π
(
s
)
=
E
P
π
(
S
)
[
R
(
S
)
∣
S
≥
s
]
=
∑
s
′
≥
s
R
(
s
′
)
P
π
(
s
′
∣
s
≤
s
′
)
V_{P_{\pi}}(s) = E_{P_{\pi}(S)}[R(S) | S \ge s] = \sum_{s' \ge s} R(s')P_{\pi}(s' | s \le s')
VPπ(s)=EPπ(S)[R(S)∣S≥s]=s′≥s∑R(s′)Pπ(s′∣s≤s′)
Prop 26. 如果在终止状态上的分布为
P
T
P_T
PT ,期望奖励为
V
P
T
(
s
)
=
∑
s
′
≥
s
R
(
s
′
)
P
T
(
s
′
∣
s
≤
s
′
)
=
∑
s
′
≥
s
R
(
s
′
)
P
(
s
′
→
s
f
∣
s
≤
s
′
)
=
∑
s
′
≥
s
R
(
s
′
)
F
(
s
′
→
s
f
)
F
(
s
)
=
∑
s
′
≥
s
R
(
s
′
)
R
(
s
′
)
∑
s
′
≥
s
R
(
s
′
)
=
∑
s
′
≥
s
R
(
s
′
)
2
∑
s
′
≥
s
R
(
s
′
)
\begin{aligned} V_{P_T}(s) &= \sum_{s' \ge s} R(s') P_T(s' | s \le s') \\ &= \sum_{s' \ge s} R(s') P(s' \rightarrow s_f | s \le s') \\ &= \sum_{s' \ge s} R(s') \frac{F(s' \rightarrow s_f)}{F(s)} \\ &= \sum_{s' \ge s} R(s') \frac{R(s')}{\sum_{s' \ge s}R(s')} \\ &= \frac{\sum_{s' \ge s}R(s')^2}{\sum_{s' \ge s}R(s')} \end{aligned}
VPT(s)=s′≥s∑R(s′)PT(s′∣s≤s′)=s′≥s∑R(s′)P(s′→sf∣s≤s′)=s′≥s∑R(s′)F(s)F(s′→sf)=s′≥s∑R(s′)∑s′≥sR(s′)R(s′)=∑s′≥sR(s′)∑s′≥sR(s′)2
Prop 27. 策略
π
\pi
π 的终止状态分布为
P
π
P_{\pi}
Pπ ,
π
\pi
π 为贪婪策略,有
π
ˉ
(
a
∣
s
)
=
0
u
n
l
e
s
s
V
P
π
(
(
s
,
a
)
)
≥
V
P
π
(
(
s
,
a
′
)
)
∀
a
′
\bar\pi(a | s) = 0 \ unless \\ V_{P_{\pi}}((s, a)) \ge V_{P_{\pi}}((s, a')) \ \forall a'
πˉ(a∣s)=0 unlessVPπ((s,a))≥VPπ((s,a′)) ∀a′
那么,对于所有的
s
s
s
V
P
π
ˉ
(
s
)
≥
V
P
π
(
s
)
V_{P_{\bar\pi}}(s) \ge V_{P_{\pi}}(s)
VPπˉ(s)≥VPπ(s)