Bellman equation in RL

本文深入探讨了强化学习中的基础概念——贝尔曼方程,特别是关于策略π的期望部分。通过解释状态转移概率和奖励分布,作者阐述了价值函数Vπ(s)和动作值函数qπ(s,a)的定义,并利用全期望定律和状态转移图解推导出它们的表达式。" 113419657,9406723,构建i.MX6ULL Yocto文件系统与SDK工具全攻略,"['Linux', '开发平台', '嵌入式']
摘要由CSDN通过智能技术生成

Bellman equation

Bellman expectation equation 是强化学习中非常基础而且重要的概念,但是有些细节却不好理解,尤其是关于 E π \mathbb{E}_{\pi} Eπ (关于 policy π \pi π 的期望)的部分。在参考了Understanding RL: The Bellman EquationsDerivation of Bellman’s Equation
这两篇文章中的推导内容之后,特地将 Bellman 公式的推理过程整理在这里。

Definition

Given a finite set of states ( S S S) and actions ( A A A), the state transition probability is p ( s ′ ∣ s , a ) = P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) p(s'|s, a) = Pr(S_{t+1}=s' \big| S_t =s, A_t=a) p(ss,a)=Pr(St+1=sSt=s,At=a) and the reward is R ( r ∣ s , a , s ′ ) = P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) R(r|s,a,s') = Pr(R_{t+1}=r \big| S_t =s, A_t=a, S_{t+1}=s') R(rs,a,s)=Pr(Rt+1=rSt=s,At=a,St+1=s)
Notice the reward is actually a distribtion other than a determined value, this is root of some misunderstanding, because some articles only write the expectation.

Here we could put state transition and reward distribution as a single probability p ( s ′ , r ∣ s , a ) = P r ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) p(s', r| s,a) = Pr( S_{t+1}=s', R_{t+1}=r \big| S_t =s, A_t=a) p(s,rs,a)=Pr(St+1=s,Rt+1=rSt=s,At=a)

Given a policy is a mapping from S S S to A A A like π ( a ∣ s ) \pi(a|s) π(as).

The value function
V π ( s ) = E π { ∑ k = 1 inf ⁡ γ k R t + k + 1 ∣ S t = s } V_\pi(s) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s\} Vπ(s)=Eπ{k=1infγkRt+k+1St=s} and action value function as
q π ( s , a ) = E π { ∑ k = 1 inf ⁡ γ k R t + k + 1 ∣ S t = s , A t = a } q_\pi(s,a) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s, A_t=a\} qπ(s,a)=Eπ{k=1infγkRt+k+1St=s,At=a}

According to the well known law of total expectation and the state transition diagram
在这里插入图片描述

q π ( s , a ) = ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) q_\pi(s,a) = \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)} qπ(s,a)=s,r[r+γVπ(s)]p(s,rs,a)

V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) = ∑ a π ( a ∣ s ) q π ( s , a ) V_\pi(s) = \sum_a{\pi(a|s) \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)}}\\ = \sum_a{\pi(a|s) q_\pi(s,a)} Vπ(s)=aπ(as)s,r[r+γVπ(s)]p(s,rs,a)=aπ(as)qπ(s,a)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值