Actor - Critic Algorithms: A Brief Note

Actor - Critic

A class of algorithms that precede Q-Learning and SARSA are actor - critic methods. Refer to

V. Konda and J. Tsitsiklis: Actor -critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143-1166 (2003).

for more details. We first give some basics:

Policy Gradient Methods & Policy Gradient Theorem

We consider methods for learning the policy weights based on the gradient of some performance measure η(θ) with respect to the policy weights. These methods maximize performance, so their updates approximate gradient ascent in η :

θt+1=θt+αη(θt)ˆ

All methods that follow this general schema are known as policy gradient methods, whether or not they also learn an approximate value function. A value function may still be used to learn the policy weight θRn , but may not be required for action selection.

We consider the episodic case only, in which performance is defined as the value of the start state under the parameterized policy, η(θ)=vπθ(s0) . (In continuing case, the performance is defined as the average reward rate.) In policy gradient methods, the policy can be parameterized in any way as long as π(a|s,θ) is differentiable with respect to its weights, i.e. θπ(a|s,θ) always exists and is finite. To ensure exploration, we generally require the policy never becomes deterministic.

Without losing any meaningful generality, we assume that every episode starts in some particular state s0 . Then we define performance as:

η(θ)=vπθ(s0)

The policy gradient theorem is that
η(θ)=sdπ(s)aqπ(s,a)θπ(a|s,θ)

where the gradients in all cases are the column vectors of partial derivatives with respect to the components of θ , π denotes the policy corresponding to weights vector θ and the distribution dπ here is the expected number of time steps t on which St=s in a randomly generated episode starting in s0 and following π and the dynamics of the MDP.

REINFORCE

Now we have an exact expression for updating gradient. We need some way of sampling whose expectation equals or approximates this expression. Notice that the right-hand side is a sum over states weighted by how often the states occurs under the target policy π weighted again by γ times how many steps it takes to get to those states. If we just follow π we will encounter states in these proportions, which we can then weighted by γt to preserve the expected value:

η(θ)=Eπ[γtaqπ(St,a)θπ(a|St,θ)]

Then we approximate a sum over actions:
η(θ)=Eπ[γtaqπ(St,a)π(a|St,θ)θπ(a|St,θ)π(a|St,θ)]=Eπ[γtqπ(St,At)θπ(At|St,θ)π(At|St,θ)]=Eπ[γtGtθπ(At|St,θ)π(At|St,θ)]

Using the sample to instantiate the generic stochastic gradient ascent algorithm, we obtain the update:
θt+1=θt+αγtGtθπ(At|St,θ)π(At|St,θ)

It is evident that REINFORCE is a type of Monte Carlo Policy Gradient methods. The update increases the weight vector in this direction proportional to the return but inversely proportional to the action probability. The former makes sense because it causes the weights to move most in the directions that favor actions yielding the highest return. The latter makes sense because otherwise actions that are selected frequently are at an advantage and might win out even if they do not yield the highest return.

The update law can also be written as:

θt+1=θt+αγtGtθlogπ(At|St,θ)

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline b(s) :

η(θ)=sdπ(s)a(qπ(s,a)b(s))θπ(a|s,θ)

the baseline can be any function, even a random variable, as long as it does not vary with a . Then we rewrite the update law to be:
θt+1=θt+α(Gtb(St))θlogπ(At|St,θ)

The baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance. One natural choice is an estimate of the state value v^(St,w) .

Actor-Critic

Methods that learn approximations to both policy and value functions are called actor-critic methods. REINFORCE with baseline methods use value functions only as a baseline, not a critic, i.e. not for bootstrapping. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation.

Consider one-step action-critic methods. The main appeal of one step methods is that they are fully online and incremental such as TD(0), SARSA(0) and Q-Learning. One-step actor-critic methods replace the full return of REINFORCE with the one-step return:

θt+1=θt+α(Rt+1+γv^(St+1,w)v^(St,w))θlogπ(At|St,θ)

Full algorithm can be formulated as follows:

  1. Initialize a differentiable policy parameterization π(a|s,θ) .
  2. Initialize a differentiable state-value parameterization v^(s,w) .
  3. Set step sizes α>0,β>0 .
  4. Initialize policy weights θ and state-value weights w .
  5. Repeat:
    1. Initialize S - first state
    2. I=1
    3. While S is not terminal
      1. Aπ(|S,θ)
      2. Take A , observe S,R
      3. δ=R+γv^(S,w)v^(S,w)
      4. w=w+βδwv^(S,w)
      5. θ=θ+αIδθlogπ(A|S,θ)
      6. I=γI
      7. S=S

      Parameterization for Continuous Actions

      Policy based methods offer practical ways of dealing with large actions spaces, even continuous spaces. All of the notes above are dealing with discrete actions, nevertheless, we begin with continuous ones from now on.

      Instead of computing learned probabilities for each of many actions, we learn the statistics of the probability distributions. Take the Gaussian distribution for example. The actions set might be the real numbers with actions chosen from a Gaussian distribution:

      p(x)=1σ2πexp((xμ)22σ2)

      The value p(x) is the density of the probability at x .

      To produce a policy parameterization, we can define the policy as the normal probability density over a real-valued scalar action, with mean and standard deviation give by parametric function approximators:

      π(a|s,θ)=1σ(s,θ)2πexp((aμ(s,θ))22σ(s,θ)2)

      Then we divide the policy weight vector into two parts: θ=[θμ,θσ]T . The mean can be approximated as a linear function:
      μ(s,θ)=θμTϕ(s)

      The standard deviation must always be positive and is better approximated as the exponential of a linear function:
      σ(s,θ)=exp(θσTϕ(s))
      \
      where ϕ(s) is a basis function of some type. With these definitions, all the algorithms described above can be applied to learn to select real-valued actions.

      Summary

      Actor -Critic, which is a branch of TD methods, keeps a separate policy independent of the value function. The policy is called the actor and the value function is called the critic. An advantage of having a separate policy representation is that if there are many actions, or when the action space is continuous, there is no need to consider all actions’ Q-values in order to select one of them. A second advantage is that they can learn stochastic policies naturally. Furthermore, a prior knowledge about policy constraints can be used.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值