Bellman equation in RL

最新推荐文章于 2024-03-20 11:31:07 发布

weixin_37823376

最新推荐文章于 2024-03-20 11:31:07 发布

阅读量198

点赞数

文章标签：强化学习

本文链接：https://blog.csdn.net/weixin_37823376/article/details/109207137

版权

本文深入探讨了强化学习中的基础概念——贝尔曼方程，特别是关于策略π的期望部分。通过解释状态转移概率和奖励分布，作者阐述了价值函数Vπ(s)和动作值函数qπ(s,a)的定义，并利用全期望定律和状态转移图解推导出它们的表达式。" 113419657,9406723,构建i.MX6ULL Yocto文件系统与SDK工具全攻略,"['Linux', '开发平台', '嵌入式']

摘要由CSDN通过智能技术生成

Bellman equation

Bellman expectation equation 是强化学习中非常基础而且重要的概念，但是有些细节却不好理解，尤其是关于 $\mathbb{E}_{\pi}$ (关于 policy $\pi$ 的期望）的部分。在参考了Understanding RL: The Bellman Equations 和 Derivation of Bellman’s Equation
这两篇文章中的推导内容之后，特地将 Bellman 公式的推理过程整理在这里。

Definition

Given a finite set of states ( $S$ ) and actions ( $A$ ), the state transition probability is $Pr(S_{t+1}=s' \big| S_t =s, A_t=a)$ and the reward is $Pr(R_{t+1}=r \big| S_t =s, A_t=a, S_{t+1}=s')$
Notice the reward is actually a distribtion other than a determined value, this is root of some misunderstanding, because some articles only write the expectation.

Here we could put state transition and reward distribution as a single probability $S_{t+1}=s', R_{t+1}=r \big| S_t =s, A_t=a)$

Given a policy is a mapping from $S$ to $A$ like $\pi(a|s)$ .

The value function
$V_\pi(s) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s\}$ and action value function as
$q_\pi(s,a) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s, A_t=a\}$