Bellman equation
Bellman expectation equation 是强化学习中非常基础而且重要的概念,但是有些细节却不好理解,尤其是关于
E
π
\mathbb{E}_{\pi}
Eπ (关于 policy
π
\pi
π 的期望)的部分。在参考了Understanding RL: The Bellman Equations 和 Derivation of Bellman’s Equation
这两篇文章中的推导内容之后,特地将 Bellman 公式的推理过程整理在这里。
Definition
Given a finite set of states (
S
S
S) and actions (
A
A
A), the state transition probability is
p
(
s
′
∣
s
,
a
)
=
P
r
(
S
t
+
1
=
s
′
∣
S
t
=
s
,
A
t
=
a
)
p(s'|s, a) = Pr(S_{t+1}=s' \big| S_t =s, A_t=a)
p(s′∣s,a)=Pr(St+1=s′∣∣St=s,At=a) and the reward is
R
(
r
∣
s
,
a
,
s
′
)
=
P
r
(
R
t
+
1
=
r
∣
S
t
=
s
,
A
t
=
a
,
S
t
+
1
=
s
′
)
R(r|s,a,s') = Pr(R_{t+1}=r \big| S_t =s, A_t=a, S_{t+1}=s')
R(r∣s,a,s′)=Pr(Rt+1=r∣∣St=s,At=a,St+1=s′)
Notice the reward is actually a distribtion other than a determined value, this is root of some misunderstanding, because some articles only write the expectation.
Here we could put state transition and reward distribution as a single probability p ( s ′ , r ∣ s , a ) = P r ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) p(s', r| s,a) = Pr( S_{t+1}=s', R_{t+1}=r \big| S_t =s, A_t=a) p(s′,r∣s,a)=Pr(St+1=s′,Rt+1=r∣∣St=s,At=a)
Given a policy is a mapping from S S S to A A A like π ( a ∣ s ) \pi(a|s) π(a∣s).
The value function
V
π
(
s
)
=
E
π
{
∑
k
=
1
inf
γ
k
R
t
+
k
+
1
∣
S
t
=
s
}
V_\pi(s) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s\}
Vπ(s)=Eπ{k=1∑infγkRt+k+1∣∣St=s} and action value function as
q
π
(
s
,
a
)
=
E
π
{
∑
k
=
1
inf
γ
k
R
t
+
k
+
1
∣
S
t
=
s
,
A
t
=
a
}
q_\pi(s,a) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s, A_t=a\}
qπ(s,a)=Eπ{k=1∑infγkRt+k+1∣∣St=s,At=a}
According to the well known law of total expectation and the state transition diagram
q π ( s , a ) = ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) q_\pi(s,a) = \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)} qπ(s,a)=s′,r∑[r+γVπ(s′)]p(s′,r∣s,a)
V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) = ∑ a π ( a ∣ s ) q π ( s , a ) V_\pi(s) = \sum_a{\pi(a|s) \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)}}\\ = \sum_a{\pi(a|s) q_\pi(s,a)} Vπ(s)=a∑π(a∣s)s′,r∑[r+γVπ(s′)]p(s′,r∣s,a)=a∑π(a∣s)qπ(s,a)