Policy Gradient Algorithms

最新推荐文章于 2024-06-27 00:37:15 发布

a1424262219

最新推荐文章于 2024-06-27 00:37:15 发布

阅读量914

点赞数

原文链接：http://www.cnblogs.com/wangxiaocvpr/p/11617854.html

版权

Policy Gradient Algorithms

2019-10-02 17:37:47

This blog is from: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html

Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, SAC, TD3 & SVPG.

What is Policy Gradient

Policy gradient is an approach to solve reinforcement learning problems. If you haven’t looked into the field of reinforcement learning, please first read the section “A (Long) Peek into Reinforcement Learning » Key Concepts” for the problem definition and key concepts.

Notations

Here is a list of notations to help you read through equations in the post easily.

Symbol	Meaning
s∈S	States.
a∈A	Actions.
r∈R	Rewards.
St,At,Rt	State, action, and reward at time step t of one trajectory. I may occasionally use st,at,rt
γ	Discount factor; penalty to uncertainty of future rewards; 0<γ≤1
Gt	Return; or discounted future reward; Gt=∑∞k=0γkRt+k+1
P(s′,r\|s,a)	Transition probability of getting to the next state s’ from the current state s with action a and reward r.
π(a\|s)	Stochastic policy (agent behavior strategy); πθ(.)
μ(s)	Deterministic policy; we can also label this as π(s)
V(s)	State-value function measures the expected return of state s; Vw(.)
Vπ(s)	The value of state s when we follow a policy π; Vπ(s)=Ea∼π[Gt\|St=s]
Q(s,a)	Action-value function is similar to V(s)
Qπ(s,a)	Similar to Vπ(.)
A(s,a)	Advantage function, A(s,a)=Q(s,a)−V(s)

Policy Gradient

The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to θ, πθ(a|s)

The reward function is defined as:

J (θ) = \sum s \in S d π (s) V π (s) = \sum s \in S d π (s) \sum a

where dπ(s)

It is natural to expect policy-based methods are more useful in the continuous space. Because there is an infinite number of actions and (or) states to estimate the values for and hence value-based approaches are way too expensive computationally in the continuous space. For example, in generalized policy iteration, the policy improvement step argmaxa∈AQπ(s,a)

Using gradient ascent, we can move θ toward the direction suggested by the gradient ∇θJ(θ)

Policy Gradient Theorem

Computing the gradient ∇θJ(θ)

Luckily, the policy gradient theorem comes to save the world! Woohoo! It provides a nice reformation of the derivative of the objective function to not involve the derivative of the state distribution dπ(.)

\nabla θ J (θ) = \nabla θ \sum s \in S d π (s) \sum a \in A Q π (

Proof of Policy Gradient Theorem

This session is pretty dense, as it is the time for us to go through the proof (Sutton & Barto, 2017; Sec. 13.1) and figure out why the policy gradient theorem is correct.

We first start with the derivative of the state value function: