PR17.10.4:Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

最新推荐文章于 2022-07-20 16:02:48 发布

Mr丶Caleb

最新推荐文章于 2022-07-20 16:02:48 发布

阅读量703

点赞数

分类专栏： Paper reading

本文链接：https://blog.csdn.net/qq_30159351/article/details/78165781

版权

Paper reading 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

What’s problem?

A major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches.
TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize.
The on-policy method provides an almost unbiased, but high variance gradient, while the off-policy method provides a deterministic, but biased gradient.

What’s the challenges?

(1) The standard form of Monte carlo policy gradient methods is shown below:

\partial J ( θ ) \partial θ = E Π [\sum t = 0 \infty \partial l o g Π ( a t | s t ) ( R t - b ( s t ) \partial θ]

$\frac{\partial J(\theta )}{\partial \theta } = E_{\Pi }[\sum_{t=0}^{\infty }\frac{\partial log\Pi (a_{t}|s_{t})(R_{t}-b(s_{t})}{\partial \theta }]$

The gradient is estimated using Monte Carlo samples in practice and has very high variance. A proper choice of baseline $b(s_{t})$ is necessary to reduce the variance sufficiently such that learning becomes feasible.
Another problem with the policy gradient is that it requires on-policy samples. This makes policy gradient optimization very sample intensive.

(2) Policy gradient methods with function approximation or actor-critic methods, include a policy evaluation step, which often uses temporal difference (TD) learning to fit a critic $Q_{w}$ for the current policy $\pi(\theta)$ , and a policy improvement step which greedily optimizes the policy $\pi$ against the critic estimate $Q_{w}$ .

The gradient in the policy improvement phase is given below:

\partial J ( θ ) \partial θ \approx E s t \sim ρ β (\cdot) [\partial Q w ( s t , a ) \partial a | a = μ θ (s t) \partial μ θ ( s t ) ) \partial θ]

$\frac{\partial J(\theta)}{\partial \theta} \approx E_{s_{t}\sim \rho _{\beta }(\cdot )}[\frac{\partial Q_{w}(s_{t},a)}{\partial a}|_{a=\mu _{\theta}(s_{t})}\frac{\partial \mu _{\theta}(s_{t}))}{\partial \theta}]$
These properties make DDPG and other analogous off-policy methods significantly more sample-efficient than policy gradient methods. Howerver, the biased policy gradient estimator makes analyzing its convergence and stability properties difficult.

What’s the STOA?

In this paper, they propose Q-Prop, a step in this direction that combines the advantages of on-policy policy gradient methods with the efficiency of off-policy learning. Q-Prop can reduce the variance of gradient estimator without adding bias.

What’s the proposed solution?

(1) To derive the Q-Prop gradient estimator, we start by using the first-order Taylor expansion of an arbitrary function $f(s_{t}, a_{t})$ .

\partial J ( θ ) \partial θ = E ρ π, π [\partial l o g π θ ( a t | s t ) ( Q ( s t , a t ) - f ¯ ( s t , a t ) ) \partial θ] + E ρ π, π [\partial l o g π θ ( a t | s t ) f ¯ ( s t , a t ) \partial θ]

$\frac{\partial J(\theta)}{\partial \theta} = E_{\rho _{\pi },\pi}[\frac{\partial log\pi_{\theta}(a_{t}|s_{t})(Q(s_{t},a_{t})-\bar{f}(s_{t},a_{t}))}{\partial \theta}]+E_{\rho _{\pi },\pi}[\frac{\partial log\pi_{\theta}(a_{t}|s_{t})\bar{f}(s_{t},a_{t})}{\partial \theta}]$

= E ρ π, π [\partial l o g π θ ( a t | s t ) ( Q ( s t , a t ) - f ¯ ( s t , a t ) ) \partial θ] + E ρ π, π [\partial f ¯ ( s t , a ) | a = a t \partial a \partial μ θ ( s t ) \partial θ]

$= E_{\rho _{\pi },\pi}[\frac{\partial log\pi_{\theta}(a_{t}|s_{t})(Q(s_{t},a_{t})-\bar{f}(s_{t},a_{t}))}{\partial \theta}]+E_{\rho _{\pi },\pi}[\frac{\partial \bar{f}(s_{t},a)|_{a=a_{t}}}{\partial a}\frac{\partial \mu _{\theta}(s_{t})}{\partial \theta}]$
A sensible choice is to use the critic

Qw $Q_{w}$ for

f $f$ and

μθ(st) $\mu_{\theta}(s_{t})$ for

at $a_{t}$ to get:

\partial J ( θ ) \partial θ = E ρ π, π [\partial l o g π θ ( a t | s t ) ( Q ^ ( s t , a t ) - Q ¯ w ( s t , a t ) ) \partial θ] + E ρ π, π [\partial Q w ( s t , a ) | a = a t \partial a \partial μ θ ( s t ) \partial θ]

$\frac{\partial J(\theta)}{\partial \theta}= E_{\rho _{\pi },\pi}[\frac{\partial log\pi_{\theta}(a_{t}|s_{t})(\hat{Q}(s_{t},a_{t})-\bar{Q}_{w}(s_{t},a_{t}))}{\partial \theta}]+E_{\rho _{\pi },\pi}[\frac{\partial Q_{w}(s_{t},a)|_{a=a_{t}}}{\partial a}\frac{\partial \mu _{\theta}(s_{t})}{\partial \theta}]$
In practice, we estimate advantages

A^(st,a) $\hat{A}(s_{t}, a)$ , we write the Q-Prop estimator in terms of advantages to complete the basic derivation.
这里写图片描述

(2) Control variate analysis and adaptive Q-prop

A weighing variable that modulates the strength of control variate. This additional variable $\eta (s_{t})$ does not introduce bias to the estimator.
这里写图片描述

注：(10)的推导没有看明白，但从公式(11)可以看到variance的大小可以由协方差前的 $-2\eta(s_{t})$ 控制。

Additionally, the paper introduce two additional variants of Q-Prop. (Adaptive Q-Prop, Conservative and Aggressive Q-Prop. See paper for details)

Pseudo-code:
这里写图片描述
source code:
https://github.com/shaneshixiang/rllabplusplus

What’s the performance of the proposed solution?

这里写图片描述
Figure 2b shows the performance of conservative Q-Prop against TRPO across different batch sizes. Due to high variance in gradient estimates, TRPO typically requires very large batch sizes.

Figure 3a shows that c-Q-Prop methods significantly outperform the best TRPO and VPG methods. DDPG on the other hand exhibits inconsistent performances. With proper reward scaling, i.e. “DDPG-r0.1”, it outperforms other methods as well as the DDPG results reported in prior work (Duan et al., 2016; Amos et al., 2016). This illustrates the sensitivity of DDPG to hyperparameter settings, while Q-Prop exhibits more stable, monotonic learning behaviors when compared to DDPG.
这里写图片描述
They evaluate Q-Prop against TRPO and DDPG across multiple domains.

Conclusion

They presented Q-Prop, a policy gradient algorithm that combines reliable, consistent, and potentially unbiased on-policy gradient estimation with a sample-efficient off-policy critic that acts as a control variate. The method provides a large improvement in sample efficiency compared to stateof-the-art policy gradient methods such as TRPO, while outperforming state-of-the-art actor-critic methods on more challenging tasks such as humanoid locomotion.

Mr丶Caleb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PR17.10.4:Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

What’s problem?A major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often req
复制链接

扫一扫

专栏目录