在强化学习rl中对于state value function和state action value function的理解

Daniel_Smith

已于 2022-11-14 15:10:19 修改

阅读量1.4k

点赞数 1

文章标签：强化学习深度学习 rl dl ml

于 2022-11-14 14:54:23 首次发布

@Copyright DanielSmith

本文链接：https://blog.csdn.net/qq_41725313/article/details/127846775

版权

在强化学习rl中对于state value function和state action value function的理解

在rl中，经常会提及两个基础的概念：
state (V) and action(Q)
或者也可以按照所刻画的内容称为：
V(s), Q(s, a)
在这里进行一定的区分和理解：

state value function:
英文解释可以理解为：
It is the expected return (cumulative reward)starting from the state s following policy, π.

我们可以将带有折扣因子的gamma的求和项写成累计g：
γ is the discount factor that determines how far future rewards are taken into account in the return

这样便是v(s)的结果表示值
action value function:
The expected return(cumulative reward) starts from state s, following policy π, taking action a.

可以看到，其中最不同的一点便是，在q function中，不仅是基于当前状态，并且还要基于某一个采取的action进行未来可能回报value的衡量
同理将求和项可以表示为：
这时候我们可以考虑一下q function与v function之间是否存在某种关系？
我们其实可以分两种方式：
a.用v表示q：

P 是一个 state-transition-matrix（状态转移矩阵）输出probability of reaching the next state s’ 从 state s
R is the immediate reward, and V is the state value of the next state s’

b.用q表示v：

在这里插入图片描述
value function 是总计的统计值：total sum of probability of choosing action or policy 乘以 the action-value of taking each action

最后可以看一下这个图片从而更好的理解两者之间的关系： 在这里插入图片描述
当然也有一些其他的理解，不过都比较准确：

在应用advantage function方面，这个工作便是例子：
Dueling Network Architectures for Deep Reinforcement Learning
另外一种理解：

基本上便是一致的表述，即为q function更加突出对action的刻画，也正是因为这个原因，他更佳适合于action space很大或者state action pair很难收集的情况！

respect！

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

打赏作者

Daniel_Smith 你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20

扫码支付：¥1

获取中

扫码支付

您的余额不足，请更换扫码支付或充值

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。