2017 Fall CS294 Lecture 6: Actor-critic introduction

最新推荐文章于 2022-04-22 13:45:25 发布

qiusuoxiaozi

最新推荐文章于 2022-04-22 13:45:25 发布

阅读量410

点赞数

分类专栏：强化学习文章标签： cs294

本文链接：https://blog.csdn.net/qiusuoxiaozi/article/details/79036543

版权

强化学习专栏收录该内容

8 篇文章 5 订阅

订阅专栏

很奇怪，没有看到Lecture 5的视频，不过Lecture 5貌似是回顾NN，也没关系，所以就跳过直接从Lecture 6开始了！

我们重现一下actor-critic的诞生过程:

这里写图片描述

上图中，其实PPT中是有动画的，但是上面无法显示出来，实际的推演过程是：
$Q^\pi(s_t,a_t)=r(s_t,a_t)+E_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}[V^\pi(s_{t+1})]$
$Q^\pi(s_t,a_t) \thickapprox r(s_t,a_t)+V^\pi(s_{t+1})$
于是有：
$A^\pi(s_t,a_t) \thickapprox r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t)$

这里想要说明一下的是，当时听完lecture后，我一直把 $V^\pi(s_t)$ 当作一个多么神秘的量，实际上，后来在读的Reinforcement Learning: An introduction(Sutton1998)一书中Value Functions相关内容时才发现，上面提到的 $V^\pi(s_t)$ 的含义其实就是一个普通的Value Functions啊，用书中的话来阐述就是：

Informally, the value of a state $s$ under a policy $\pi$ , denoted $V^\pi(s)$ , is the expected return when starting in s and following thereafter.

我把书中相关的一页截图如下，里面还涉及到了 $Q^\pi$ 的解释，读完这一页，感觉真是神清气爽。

记住这两个名字：
$V^\pi(s)$ : the state-value function for policy $\pi$
$Q^\pi(s,a)$ : the action-value function for policy $\pi$

这里写图片描述

Policy Evaluation（也就是下图的step2）有两种方法，

这里写图片描述

都在先前的ppt中提到了，如下。其中第二种bootstrapped estimate用的更多。

这里写图片描述

将actor-critic改造成online形式，这样一来，可以发现第二步就只能使用先前提到的boostrapped estimate了。

这里写图片描述

网络设计方案，有两种

这里写图片描述

同步和异步并行方案

这里写图片描述

Critics as state-dependent baselines，这个我还看得懂，相当于把critic放到PG里面做为baseline

这里写图片描述

Control variates: action-dependent baselines，这个我就看不懂了。

这里写图片描述

往后也是。

qiusuoxiaozi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2017 Fall CS294 Lecture 6: Actor-critic introduction

很奇怪，没有看到Lecture 5的视频，不过Lecture 5貌似是回顾NN，也没关系，所以就跳过直接从Lecture 6开始了！我们重现一下actor-critic的诞生过程:上图中，其实PPT中是有动画的，但是上面无法显示出来，实际的推演过程是： Qπ(st,at)=r(st,at)+Est+1∼p(st+1|st,at)[Vπ(st+1)]Qπ(st,at)=r(st,at)...
复制链接

扫一扫

专栏目录