演员-评论家强化学习综述：标准政策梯度与自然策略梯度

最新推荐文章于 2025-08-05 17:58:34 发布

原创最新推荐文章于 2025-08-05 17:58:34 发布 · 2.1k 阅读

20 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients
演员-评论家强化学习综述：标准政策梯度与自然策略梯度

Abstract: 抽象：

Policy-gradient-based actor-critic algorithms are amongst the most popular algorithms in the reinforcement learning framework.
基于策略梯度的 actor-critic 算法是强化学习框架中最受欢迎的算法之一。
Their advantage of being able to search for optimal policies using low-variance gradient estimates has made them useful in several real-life applications, such as robotics, power control, and finance.
它们能够使用低方差梯度估计来搜索最佳策略，这使它们在一些现实生活中很有用，例如机器人、电源控制和金融。
Although general surveys on reinforcement learning techniques already exist, no survey is specifically dedicated to actor-critic algorithms in particular.
尽管关于强化学习技术的一般调查已经存在，但没有专门针对演员-评论家算法的调查。
This paper, therefore, describes the state of the art of actor-critic algorithms, with a focus on methods that can work in an online setting and use function approximation in order to deal with continuous state and action spaces.
因此，本文描述了演员-评论家算法的最新技术，重点关注可以在在线环境中工作并使用函数近似来处理连续状态和动作空间的方法。
After starting with a discussion on the concepts of reinforcement learning and the origins of actor-critic algorithms, this paper describes the workings of the natural gradient, which has made its way into many actor-critic algorithms over the past few years.
在讨论了强化学习的概念和演员-评论家算法的起源之后，本文描述了自然梯度的工作原理，在过去几年中，自然梯度已经进入了许多演员-评论家算法。
A review of several standard and natural actor-critic algorithms is given, and the paper concludes with an overview of application areas and a discussion on open issues.
本文回顾了几种标准和自然的 actor-critic 算法，最后概述了应用领域并讨论了开放性问题。
Published in: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) ( Volume: 42, Issue: 6, November 2012)
发表于： IEEE Transactions on Systems， Man， and Cybernetics， Part C （Applications and Reviews）（卷： 42，期： 6， November 2012）
Page(s): 1291 - 1307 页数： 1291 - 1307
Date of Publication: November 2012
出版日期：2012 年 11 月
ISSN Information: ISSN信息：
DOI: 10.1109/TSMCC.2012.2218595
DOI： 10.1109/TSMCC.2012.2218595
Publisher: IEEE 发布者： IEEE

SECTION I. 第一节总则Introduction 介绍

Reinforcement Learning (RL) is a framework in which an agent (or controller) optimizes its behavior by interacting with its environment.
强化学习（RL）是一个框架，在该框架中，代理（或控制器）通过与其环境交互来优化其行为。
After taking an action in some state, the agent receives a scalar reward from the environment, which gives the agent an indication of the quality of that action. The function that indicates the action to take in a certain state is called the policy. The main goal of the agent is to find a policy that maximizes the total accumulated reward, also called the return. By following a given policy and processing the rewards, the agent can build estimates of the return. The function representing this estimated return is known as the value function. Using this value function allows the agent to make indirect use of past experiences to decide on future actions to take in or around a certain state.
在某种状态下执行操作后，代理会从环境中收到标量奖励，这为代理提供了该操作质量的指示。指示在特定状态下要执行的操作的函数称为策略。代理的主要目标是找到一个最大化总累积奖励（也称为回报）的策略。通过遵循给定的政策并处理奖励，代理可以建立回报的估计值。表示此估计回报的函数称为值函数。使用此值函数，智能体可以间接利用过去的经验来决定在某个状态下或围绕某个状态采取的未来行动。

Over the course of time, several types of RL algorithms have been introduced, and they can be divided into three groups [1]: actor-only, critic-only, and actor-critic methods, where the words actor and critic are synonyms for the policy and value function, respectively. Actor-only methods typically work with a parameterized family of policies over which optimization procedures can be used directly.
随着时间的流逝，已经引入了几种类型的 RL 算法，它们可以分为三组 [1] ：仅 actor、仅 critic 和 actor-critic 方法，其中 actor 和 critic 这两个词分别是策略和值函数的同义词。仅执行组件方法通常使用参数化策略系列，可以直接使用优化过程。
The benefit of a parameterized policy is that a spectrum of continuous actions can be generated, but the optimization methods used (typically policy gradient methods) suffer from high variance in the estimates of the gradient, leading to slow learning [1]–[5].
参数化策略的好处是可以生成一系列连续操作，但所使用的优化方法（通常是策略梯度方法）在梯度估计中存在很大差异，导致学习 [1] 缓慢 – [5] 。

Critic-only methods that use temporal difference (TD) learning have a lower variance in the estimates of expected returns [3], [5], [6]. A straightforward way of deriving a policy in critic-only methods is by selecting greedy actions [7]: actions for which the value function indicates that the expected return is the highest. However, to do this, one needs to resort to an optimization procedure in every state encountered to find the action leading to an optimal value.
使用时间差异（TD）学习的仅评论家方法在预期回报 [3] 的估计中具有较低的方差，， [5] [6] .在仅限批评者的方法中派生策略的一种直接方法是选择贪婪操作：值函数指示预期回报最高的操作 [7] 。然而，要做到这一点，需要在遇到的每种状态下求助于优化过程，以找到导致最佳值的动作。
This can be computationally intensive, especially if the action space is continuous. Therefore, critic-only methods usually discretize the continuous action space, after which the optimization over the action space becomes a matter of enumeration.
这可能是计算密集型的，尤其是在操作空间是连续的时。因此，仅批评者方法通常离散连续动作空间，之后对动作空间的优化就变成了枚举问题。
Obviously, this approach undermines the ability of using continuous actions and thus of finding the true optimum.
显然，这种方法破坏了使用连续动作的能力，从而破坏了找到真正最优的能力。

Actor-critic methods combine the advantages of actor-only and critic-only methods.
演员-评论家方法结合了仅演员和仅评论家方法的优点。
While the parameterized actor brings the advantage of computing continuous actions without the need for optimization procedures on a value function, the critic’s merit is that it supplies the actor with low-variance knowledge of the performance.
虽然参数化参与者带来了计算连续动作而不需要对值函数进行优化过程的优势，但批评者的优点在于它为参与者提供了性能的低方差知识。
More specifically, the critic’s estimate of the expected return allows for the actor to update with gradients that have lower variance, speeding up the learning process.
更具体地说，批评者对预期回报的估计允许参与者使用方差较小的梯度进行更新，从而加快学习过程。
The lower variance is traded for a larger bias at the start of learning when the critic’s estimates are far from accurate [5]. Actor-critic methods usually have good convergence properties, in contrast with critic-only methods [1].
当批评者的估计远非准确 [5] 时，较低的方差在学习开始时被换成更大的偏差。与仅评论家方法相比，演员-批评方法 [1] 通常具有良好的收敛性。

These nice properties of actor-critic methods have made them a preferred RL algorithm, also in real-life application domains. General surveys on RL already exist [8]–[10], but because of the growing popularity and recent developments in the field of actor-critic algorithms, this class of reinforcement algorithms deserves a survey in its own right.
actor-critic 方法的这些良好特性使它们成为首选的 RL 算法，在现实生活中的应用领域也是如此。关于RL的一般调查已经存在 [8] ， [10] 但由于演员-评论家算法领域的日益普及和最新发展，这类强化算法本身就值得进行调查。
The goal of this paper is to give an overview of the work on (online) actor-critic algorithms, giving technical details of some representative algorithms, as well as to provide references to a number of application papers.
本文的目的是概述（在线）演员-评论家算法的工作，提供一些代表性算法的技术细节，并提供一些应用论文的参考。
Additionally, the algorithms are presented in one unified notation, which allows for a better technical comparison of the variants and implementations.
此外，这些算法以一种统一的符号表示，从而可以更好地对变体和实现进行技术比较。
Because the discrete-time variant has been developed to a reasonable level of maturity, this paper solely discusses algorithms in the discrete-time setting. Continuous-time variants of actor-critic algorithms (see, e.g., [11] and [12]) and multiagent actor-critic schemes [13], [14] are not considered here.
由于离散时间变量已经发展到合理的成熟度水平，因此本文仅讨论离散时间设置中的算法。这里不考虑演员-批评者算法的连续时间变体（参见，例如， [11] 和）和 [12] 多智能体演员-批评方案 [13] [14] 。

The focus is put on actor-critic algorithms based on policy gradients, which constitute the largest part of actor-critic algorithms. A distinction is made between algorithms that use a standard (sometimes also called vanilla) gradient and the natural gradient that became more popular in the course of the last decade. The remaining part of actor-critic algorithms consists mainly of algorithms that update a policy by moving it toward the greedy policy underlying an approximate state-action value function [10]. in this paper, these algorithms are regarded as critic-only algorithms as the policy is implemented implicitly by the critic. Algorithms are only categorized as actor-critic here if they implement two separately parameterized representations for the actor and the critic.
重点放在基于策略梯度的演员-评论家算法上，它构成了演员-评论家算法的最大部分。使用标准（有时也称为普通）梯度的算法和在过去十年中越来越流行的自然梯度是有区别的。actor-critic 算法的其余部分主要由算法组成，这些算法通过将策略移动到近似状态-动作值函数 [10] 底层的贪婪策略来更新策略。在本文中，这些算法被视为仅批评者算法，因为策略是由批评者隐式实现的。只有当算法为演员和评论家实现两个单独的参数化表示时，它们才会被归类为演员-评论家。
Furthermore, all algorithms make use of function approximation, which in real-life applications such as robotics is necessary in order to deal with continuous state and action spaces.
此外，所有算法都利用函数逼近，这在机器人等现实应用中对于处理连续状态和动作空间是必要的。

This paper is organized as follows. Section II introduces the basic concepts of a Markov decision process (MDP), which is the cornerstone of RL. Section III describes critic-only, actor-only, and actor-critic RL algorithms and the important policy gradient theorem, after which Section IV surveys actor-critic algorithms that use a standard gradient. Section V describes the natural gradient and its application to actor-critic methods, as well as surveys several natural actor-critic algorithms. Section VI briefly reviews the application areas of these methods. A discussion and future outlook is provided in Section VII.
本文组织如下。 Section II 介绍了马尔可夫决策过程（MDP）的基本概念，这是RL的基石。 Section III 描述了仅批评者、仅参与者和演员-批评者 RL 算法以及重要的策略梯度定理，然后 Section IV 调查了使用标准梯度的 actor-critic 算法。 Section V 描述了自然梯度及其在演员-评论家方法中的应用，并调查了几种自然演员-评论家算法。 Section VI 简要回顾了这些方法的应用领域。中 Section VII 提供了讨论和未来展望。

SECTION II. 第二节.Markov Decision Processes 马尔可夫决策过程

This section introduces the concepts of discrete-time RL, based on [7], but extended to the use of continuous state and action spaces and also assuming a stochastic setting, as covered more extensively in [15] and [16].
本节介绍离散时间RL的概念，它基于，但扩展到使用连续状态和动作空间，并假设随机设置 [7] ，如 [15] 和 [16] 中更广泛地介绍的那样。

An RL algorithm can be used to solve problems modeled as MDPs. An MDP is a tuple ⟨X,U,f,ρ⟩, where X denotes the state space, U the action space, f:X×U×X↦[0,∞) the state transition probability density function, and ρ:X×U×X↦R the reward function. in this paper, only stationary MDPs are considered, i.e., the elements of the tuple ⟨X,U,f,ρ⟩ do not change over time.
RL 算法可用于解决建模为 MDP 的问题。MDP 是一个元组 ⟨X,U,f,ρ⟩ ，其中 X 表示状态空间、动作空间、 U f:X×U×X↦[0,∞) 状态转移概率密度函数和 ρ:X×U×X↦R 奖励函数。在本文中，仅考虑了稳态MDP，即元组的元素 ⟨X,U,f,ρ⟩ 不会随时间而变化。

The stochastic process to be controlled is described by the state transition probability density function f. It is important to note that since state space is continuous, it is only possible to define a probability of reaching a certain state region, since the probability of reaching a particular state is zero. The probability of reaching a state xk+1 in the region Xk+1⊆X from state xk after applying action uk is
P(xk+1∈Xk+1|xk,uk)=∫Xk+1f(xk,uk,x′)dx′.
View SourceRight-click on figure for MathML and additional features.After each transition to a state xk+1, the controller receives an immediate reward
rk+1=ρ(xk,uk,xk+1)
View SourceRight-click on figure for MathML and additional features.which depends on the previous state, the current state, and the action taken. The reward function ρ is assumed to be bounded. The action uk taken in a state xk is drawn from a stochastic policy π:X×U↦[0,∞).
要控制的随机过程由状态转移概率密度函数 f 来描述。需要注意的是，由于状态空间是连续的，因此只能定义到达某个状态区域的概率，因为到达特定状态的概率为零。应用操作 uk 后从状态 xk 到达区域 Xk+1⊆X 中状态 xk+1 的概率为
P(xk+1∈Xk+1|xk,uk)=∫Xk+1f(xk,uk,x′)dx′.
View SourceRight-click on figure for MathML and additional features.每次转换到状态 xk+1 后，控制器会立即收到奖励
rk+1=ρ(xk,uk,xk+1)
View SourceRight-click on figure for MathML and additional features.，这取决于先前的状态、当前状态和所采取的操作。假设奖励函数 ρ 是有界的。在一个状态 xk 中采取的行动 uk 是从随机策略 π:X×U↦[0,∞) 中得出的。

The goal of the RL agent is to find the policy π which maximizes the expected value of a certain function g of the immediate rewards received while following the policy π. This expected value is the cost-to-go function
J(π)=E{g(r1,r2,…)|π}.
View SourceRight-click on figure for MathML and additional features.In most cases,1 the function g is either the discounted sum of rewards or the average reward received, as explained next.
🔄

❓

A. Discounted Reward
🔄

❓
In the discounted reward setting [18], the cost function J is equal to the expected value of the discounted sum of rewards when starting from an initial state x0∈X drawn from an initial state distribution x0∼d0(⋅); this sum is also called the discounted return
J(π)=E{∑k=0∞γkrk+1∣∣∣d0,π}=∫Xdπγ(x)∫Uπ(x,u)∫Xf(x,u,x′)ρ(x,u,x′)dx′dudx,(1)
View SourceRight-click on figure for MathML and additional features.where dπγ(x)=∑∞k=0γkp(xk=x|d0,π) is the discounted state distribution under the policy π [16], [19], and γ∈[0,1) denotes the reward discount factor. Note that p(xk=x) is a probability density function here.
🔄

❓

During learning, the agent will have to estimate the cost-to-go function J for a given policy π. This procedure is called policy evaluation. The resulting estimate of J is called the value function and two definitions exist for it. The state value function
Vπ(x)=E{∑k=0∞γkrk+1∣∣∣x0=x,π}(2)
View SourceRight-click on figure for MathML and additional features.only depends on the state x and assumes that the policy π is followed starting from this state. The state-action value function
Qπ(x,u)=E{∑k=0∞γkrk+1∣∣∣x0=x,u0=u,π}(3)
View SourceRight-click on figure for MathML and additional features.also depends on the state x, but makes the action u chosen in this state a free variable instead of having it generated by the policy π. Once the first transition onto a next state has been made, π governs the rest of the action selection. The relationship between these two definitions for the value function is given by
Vπ(x)=E{Qπ(x,u)|u∼π(x,⋅)}.
View SourceRight-click on figure for MathML and additional features.
在学习过程中，代理必须估计给定策略 π 的成本函数 J 。此过程称为策略评估。得到的 J 估计值称为值函数，它有两个定义。状态值函数
Vπ(x)=E{∑k=0∞γkrk+1∣∣∣x0=x,π}(2)
View SourceRight-click on figure for MathML and additional features.仅依赖于状态，并假定从此状态 x 开始遵循策略 π 。状态-动作值函数
Qπ(x,u)=E{∑k=0∞γkrk+1∣∣∣x0=x,u0=u,π}(3)
View SourceRight-click on figure for MathML and additional features.也依赖于状态，但使在此状态 x 下选择的动作 u 成为自由变量，而不是由策略 π 生成。一旦第一次转换到下一个状态， π 将控制其余的操作选择。值函数的这两个定义之间的关系由下式
Vπ(x)=E{Qπ(x,u)|u∼π(x,⋅)}.
View SourceRight-click on figure for MathML and additional features.给出

With some manipulation, (2) and (3) can be put into a recursive form [18]. for the state value function, this is
Vπ(x)=E{ρ(x,u,x′)+γVπ(x′)}(4)
View SourceRight-click on figure for MathML and additional features.with u drawn from the probability distribution function π(x,⋅) and x′ drawn from f(x,u,⋅). for the state-action value function, the recursive form is
Qπ(x,u)=E{ρ(x,u,x′)+γQπ(x′,u′)}(5)
View SourceRight-click on figure for MathML and additional features.with x′ drawn from the probability distribution function f(x,u,⋅) and u′ drawn from the distribution π(x′,⋅). These recursive relationships are called Bellman equations [7].
通过一些操作， (2) 可以 (3) 放入递归形式 [18] 。对于状态值函数，这是
Vπ(x)=E{ρ(x,u,x′)+γVπ(x′)}(4)
View SourceRight-click on figure for MathML and additional features.从 u 概率分布函数 π(x,⋅) 中绘制的，并从中绘制 f(x,u,⋅) 的 x′ 。对于状态-动作值函数，递归形式是从
Qπ(x,u)=E{ρ(x,u,x′)+γQπ(x′,u′)}(5)
View SourceRight-click on figure for MathML and additional features.概率分布函数 x′ f(x,u,⋅) 中抽取的，从分布 π(x′,⋅) 中抽取的 u′ 。这些递归关系称为贝尔曼方程。 [7]

Optimality for both the state value function Vπ and the state-action value function Qπ is governed by the Bellman optimality equation. Denoting the optimal state value function with V∗(x) and the optimal state-action value with Q∗(x,u), the corresponding Bellman optimality equations for the discounted reward setting are
V∗(x)Q∗(x,u)=maxuE{ρ(x,u,x′)+γV∗(x′)}=E{ρ(x,u,x′)+γmaxu′Q∗(x′,u′)}.(6a)(6b)
View SourceRight-click on figure for MathML and additional features.

B. Average Reward
As an alternative to the discounted reward setting, there is also the approach of using the average return [18]. in this setting, a distribution d0 does not need to be chosen, under the assumption that the process is ergodic [7] and thus that J does not depend on the starting state. Instead, the value functions for a policy π are defined relative to the average expected reward per time step under the policy, turning the cost-to-go function into
J(π)=limn→∞1nE{∑k=0n−1rk+1∣∣∣π}=∫Xdπ(x)∫Uπ(x,u)∫Xf(x,u,x′)ρ(x,u,x′)dx′dudx.(7)
View SourceRight-click on figure for MathML and additional features.Equation (7) is very similar to (1), except that the definition for the state distribution changed to dπ(x)=limk→∞p(xk=x,π). for a given policy π, the state value function Vπ(x) and state-action value function Qπ(x,u) are then defined as
Vπ(x)Qπ(x,u)=E{∑k=0∞(rk+1−J(π))∣∣∣x0=x,π}=E{∑k=0∞(rk+1−J(π))∣∣∣x0=x,u0=u,π}.
View SourceRight-click on figure for MathML and additional features.The Bellman equations for the average reward—in this case also called the Poisson equations [20]—are
Vπ(x)+J(π)=E{ρ(x,u,x′)+Vπ(x′)}(8)
View SourceRight-click on figure for MathML and additional features.with u and x′ drawn from the appropriate distributions as before and
Qπ(x,u)+J(π)=E{ρ(x,u,x′)+Qπ(x′,u′)}(9)
View SourceRight-click on figure for MathML and additional features.again with x′ and u′ drawn from the appropriate distributions. Note that (8) and (9) both require the value J(π), which is unknown and hence needs to be estimated in some way. The Bellman optimality equations, describing an optimum for the average reward case, are
V∗(x)+J∗Q∗(x,u)+J∗=maxuE{ρ(x,u,x′)+V∗(x′)}=E{ρ(x,u,x′)+maxu′Q∗(x′,u′)}(10a)(10b)
View SourceRight-click on figure for MathML and additional features.where J∗ is the optimal average reward as defined by (7) when an optimal policy π∗ is used.

SECTION III.Actor-Critic in the Context of Reinforcement Learning
As discussed in the introduction, the vast majority of RL methods can be divided into three groups [1]: critic-only, actor-only, and actor-critic methods. This section will give an explanation on all three groups, starting with critic-only methods. The part on actor-only methods introduces the concept of a policy gradient, which provides the basis for actor-critic algorithms. The final part of this section explains the policy gradient theorem, an important result that is now widely used in many implementations of actor-critic algorithms.

In real-life applications, such as robotics, processes usually have continuous state and action spaces, making it impossible to store exact value functions or policies for each separate state or state-action pair. Any RL algorithm used in practice will have to make use of function approximators for the value function and/or the policy in order to cover the full range of states and actions. Therefore, this section assumes the use of such function approximators.

A. Critic-Only Methods
Critic-only methods, such as Q-learning [21]–[23] and SARSA [24], use a state-action value function and no explicit function for the policy. for continuous state and action spaces, this will be an approximate state-action value function.
These methods learn the optimal value function by finding online an approximate solution to the Bellman equations (6b) or (10b). A deterministic policy, denoted by π:X↦U, is calculated by using an optimization procedure over the value function
π(x)=argmaxuQ(x,u).(11)
View SourceRight-click on figure for MathML and additional features.

There is no reliable guarantee on the near-optimality of the resulting policy for just any approximated value function when learning in an online setting. for example, Q-learning and SARSA with specific function approximators have been shown not to converge even for simple MDPs [25]–[27]. However, the counterexamples used to show divergence were further analyzed in [28] (with an extension to the stochastic setting in [29]), and it was shown that convergence can be assured for linear-in-parameters function approximators if trajectories are sampled according to their on-policy distribution. The work in [28] also provides a bound on the approximation error between the true value function and the approximation learned by online TD learning. An analysis of more approximate policy evaluation methods is provided in [30], mentioning conditions for convergence and bounds on the approximation error for each method. Nevertheless, for most choices of basis functions, an approximated value function learned by TD learning will be biased.
This is reflected by the state-of-the-art bounds on the least-squares temporal difference (LSTD) solution quality [31], which always include a term depending on the distance between the true value function and its projection on the approximation space. for a particularly bad choice of basis functions, this bias can grow very large.

B. Actor-Only Methods and the Policy Gradient
Policy gradient methods (see, for instance, the SRV [32] and Williams’ REINFORCE algorithms [33]) are principally actor-only and do not use any form of a stored value function. Instead, the majority of actor-only algorithms work with a parameterized family of policies and optimize the cost defined by (1) or (7) directly over the parameter space of the policy. Although not explicitly considered here, work on nonparametric policy gradients does exist (see, e.g., [34] and [35]). A major advantage of actor-only methods over critic-only methods is that they allow the policy to generate actions in the complete continuous action space.

A policy gradient method is generally obtained by parameterizing the policy π by the parameter vector ϑ∈Rp. Considering that both (1) and (7) are functions of the parameterized policy πϑ, they are in fact functions of ϑ. Assuming that the parameterization is differentiable with respect to ϑ, the gradient of the cost function with respect to ϑ is described by
∇ϑJ=∂J∂πϑ∂πϑ∂ϑ.(12)
View SourceRight-click on figure for MathML and additional features.Then, by using standard optimization techniques, a locally optimal solution of the cost J can be found. The gradient ∇ϑJ is estimated per time step, and the parameters are then updated in the direction of this gradient. for example, a simple gradient ascent method would yield the policy gradient update equation
ϑk+1=ϑk+αa,k∇ϑJk(13)
View SourceRight-click on figure for MathML and additional features.where αa,k>0 is a small enough learning rate for the actor, by which it is obtained that2 J(ϑk+1)≥J(ϑk).

Several methods exist to estimate the gradient, e.g., by using infinitesimal perturbation analysis or likelihood-ratio methods [36], [37]. for a broader discussion on these methods, see [4] and [38]. Approaches to model-based gradient methods are given in [39]–[41] and in the more recent work of Deisenroth and Rasmussen [42].

The main advantage of actor-only methods is their strong convergence property, which is naturally inherited from gradient descent methods. Convergence is obtained if the estimated gradients are unbiased and the learning rates αa,k satisfy [7], [38]
∑k=0∞αa,k=∞,∑k=0∞α2a,k<∞.
View SourceRight-click on figure for MathML and additional features.

A drawback of the actor-only approach is that the estimated gradient may have a large variance [19], [43]. in addition, every gradient is calculated without using any knowledge of past estimates [1], [44].

C. Actor-Critic Algorithms
Actor-critic methods [45], [46] aim to combine the advantages of actor-only and critic-only methods. Like actor-only methods, actor-critic methods are capable of producing continuous actions, while the large variance in the policy gradients of actor-only methods is countered by adding a critic. The role of the critic is to evaluate the current policy prescribed by the actor. in principle, this evaluation can be done by any policy evaluation method commonly used, such as TD [6], [18], LSTD [3] [18], [47], or residual gradients [25]. The critic approximates and updates the value function using samples. The value function is then used to update the actor’s policy parameters in the direction of performance improvement. These methods usually preserve the desirable convergence properties of policy gradient methods, in contrast with critic-only methods. in actor-critic methods, the policy is not directly inferred from the value function by using (11). Instead, the policy is updated in the policy gradient direction using only a small step size αa, meaning that a change in the value function will only result in a small change in the policy, leading to less or no oscillatory behavior in the policy, as described in [48].

Fig. 1 shows the schematic structure of an actor-critic algorithm. The learning agent has been split into two separate entities: the actor (policy) and the critic (value function). The actor is only responsible for generating a control input u, given the current state x. The critic is responsible for processing the rewards it receives, i.e., evaluating the quality of the current policy by adapting the value function estimate. After a number of policy evaluation steps by the critic, the actor is updated by using information from the critic.

Fig. 1. - Schematic overview of an actor-critic algorithm. The dashed line indicates that the critic is responsible for updating the actor and itself.
Fig. 1.
Schematic overview of an actor-critic algorithm. The dashed line indicates that the critic is responsible for updating the actor and itself.

Show All

A unified notation for the actor-critic algorithms described in this paper allows for an easier comparison between them. in addition, most algorithms can be fitted to a general template of standard update rules. Therefore, two actor-critic algorithm templates are introduced: one for the discounted reward setting and one for the average reward setting.
Once these templates are established, specific actor-critic algorithms can be discussed by only looking at how they fit into the general template or in what way they differ from it.

For both reward settings, the value function is parameterized by the parameter vector θ∈Rq. This will be denoted with Vθ(x) or Qθ(x,u). If the parameterization is linear, the features (basis functions) will be denoted with ϕ, i.e.,
Vθ(x)=θ⊤ϕ(x)or Qθ(x,u)=θ⊤ϕ(x,u).(14)
View SourceRight-click on figure for MathML and additional features.The stochastic policy π is parameterized by ϑ∈Rp and will be denoted with πϑ(x,u). If the policy is denoted with πϑ(x), it is deterministic and no longer represents a probability density function, but the direct mapping from states to actions u=πϑ(x).

The goal in actor-critic algorithms—or any other RL algorithm for that matter—is to find the best policy possible, given some stationary MDP. A prerequisite for this is that the critic is able to accurately evaluate a given policy.
in other words, the goal of the critic is to find an approximate solution to the Bellman equation for that policy. The difference between the right-hand and left-hand sides of the Bellman equation, whether it is the one for the discounted reward setting (4) or the average reward setting (8), is called the TD error and is used to update the critic. Using the function approximation for the critic and a transition sample (xk,uk,rk+1,xk+1), the TD error is estimated as
δk=rk+1+γVθk(xk+1)−Vθk(xk).(15)
View SourceRight-click on figure for MathML and additional features.Perhaps the most standard way of updating the critic is to exploit this TD error for use in a gradient descent update [7]
θk+1=θk+αc,kδk∇θVθk(xk)(16)
View SourceRight-click on figure for MathML and additional features.where αc,k>0 is the learning rate of the critic. for the linearly parameterized function approximator (14), this reduces to
θk+1=θk+αc,kδkϕ(xk).(17)
View SourceRight-click on figure for MathML and additional features.This TD method is also known as TD(0) learning, as no eligibility traces are used. The extension to the use of eligibility traces, resulting in TD(λ) methods, is straightforward and is explained next.

Using (16) to update the critic results in a one-step backup, whereas the reward received is often the result of a series of steps. Eligibility traces offer a better way of assigning credit to states or state-action pairs visited several steps earlier. The eligibility trace vector for all q features at time instant k is denoted with zk∈Rq and its update equation is [1], [7]
zk=λγzk−1+∇θVθk(xk).
View SourceRight-click on figure for MathML and additional features.It decays with time by a factor λγ, with λ∈[0,1) the trace decay rate. This makes the recently used features more eligible for receiving credit. The use of eligibility traces speeds up the learning considerably. Using the eligibility trace vector zk, the update (16) of the critic becomes
θk+1=θk+αc,kδkzk.(18)
View SourceRight-click on figure for MathML and additional features.

With the use of eligibility traces, the actor-critic template for the discounted return setting becomes
δkzkθk+1ϑk+1=rk+1+γVθk(xk+1)−Vθk(xk)=λγzk−1+∇θVθk(xk)=θk+αc,kδkzk=ϑk+αa,k∇ϑJk.(19a)(19b)(19c)(19d)
View SourceRight-click on figure for MathML and additional features.

Although not commonly seen, eligibility traces may be introduced for the actor as well. As with actor-only methods, several ways exist to estimate ∇ϑJk.

For the average reward case, the critic can be updated using the average-cost TD method [49]. Then, Bellman equation (8) is considered, turning the TD error into
δk=rk+1−J^k+Vθk(xk+1)−Vθk(xk)
View SourceRight-click on figure for MathML and additional features.with J^k an estimate of the average cost at time k. Obviously, this requires an update equation for the estimate J^ as well, which usually is [1]
J^k=Jk−1+αJ,k(rk+1−J^k−1)
View SourceRight-click on figure for MathML and additional features.where αJ,k∈(0,1] is another learning rate. The critic still updates with (18). The update of the eligibility trace also needs to be adjusted, as the discount rate γ is no longer present. The template for actor-critic algorithms in the average return setting then is
J^{kδkzkθk+1ϑk+1=J}k−1+αJ,k(rk+1−J^{k−1)=rk+1−J}k+Vθk(xk+1)−Vθk(xk)=λzk−1+∇θVθk(xk)=θk+αc,kδkzk=ϑk+αa,k∇ϑJk.(20a)(20b)(20c)(20d)(20e)
View SourceRight-click on figure for MathML and additional features.

For the actor-critic algorithm to converge, it is necessary that the critic’s estimate is at least asymptotically accurate. This is the case if the step sizes αa,k and αc,k are deterministic, nonincreasing, and satisfy [1]
∑kαa,k=∞,∑kα2a,k<∞,∑kαc,k=∞∑kα2c,k<∞,∑k(αa,kαc,k)d<∞(21)(22)
View SourceRight-click on figure for MathML and additional features.for some d≥0. The learning rate αJ,k is usually set equal to αc,k. Note that such assumptions on learning rates are typical for all RL algorithms. They ensure that learning will slow down but never stops and also that the update of the actor operates on a slower time-scale than the critic, to ensure that the critic has enough time to evaluate the current policy.

Although TD(λ) learning is used quite commonly, other ways of determining the critic parameter θ do exist and some are even known to be superior in terms of convergence rate in both discounted and average reward settings [50], such as LSTD [3], [47]. LSTD uses samples collected along a trajectory generated by a policy π to set up a system of TD equations derived from or similar to (19a) or (20b). As LSTD requires an approximation of the value function which is linear in its parameters, i.e., Vθ(x)=θ⊤ϕ(x), this system is linear and can easily be solved for θ by a least-squares method. Regardless of how the critic approximates the value function, the actor update is always centered around (13), using some way to estimate ∇ϑJk.

For actor-critic algorithms, the question arises how the critic influences the gradient update of the actor. This is explained in the next section about the policy gradient theorem.

D. Policy Gradient Theorem
Many actor-critic algorithms now rely on the policy gradient theorem, a result obtained simultaneously in [1] and [19], proving that an unbiased estimate of the gradient (12) can be obtained from experience using an approximate value function satisfying certain properties. The basic idea, given by Konda and Tsitsiklis [1], is that since the number of parameters that the actor has to update is relatively small compared with the (usually infinite) number of states, it is not useful to have the critic attempting to compute the exact value function, which is also a high-dimensional object.
Instead, it should compute a projection of the value function onto a low-dimensional subspace spanned by a set of basis functions, which are completely determined by the parameterization of the actor.

In the case of an approximated stochastic policy, but exact state-action value function Qπ, the policy gradient theorem is as follows.

Theorem 1 (Policy Gradient): for any MDP, in either the average reward or discounted reward setting, the policy gradient is given by
∇ϑJ=∫Xdπ(x)∫U∇ϑπ(x,u)Qπ(x,u)dudx
View SourceRight-click on figure for MathML and additional features.with dπ(x) defined for the appropriate reward setting.

Proof: See [19].▪

This clearly shows the relationship between the policy gradient ∇ϑJ and the critic function Qπ(x,u) and ties together the update equations of the actor and critic in the templates (19) and (20).

For most applications, the state-action space is continuous and thus infinite, which means that it is necessary to approximate the state(-action) value function. The result in [1] and [19] shows that Qπ(x,u) can be approximated with3 hw:X×U↦R, parameterized by w, without affecting the unbiasedness of the policy gradient estimate.

In order to find the closest approximation of Qπ by hw, one can try to find the w that minimizes the quadratic error
Eπw(x,u)=12[Qπ(x,u)−hw(x,u)]2.
View SourceRight-click on figure for MathML and additional features.The gradient of this quadratic error with respect to w is
∇wEπw(x,u)=[Qπ(x,u)−hw(x,u)]∇whw(x,u)(23)
View SourceRight-click on figure for MathML and additional features.and this can be used in a gradient descent algorithm to find the optimal w. If the estimator of Qπ(x,u) is unbiased, the expected value of (23) is zero for the optimal w, i.e.,
∫Xdπ(x)∫Uπ(x,u)∇wEπw(x,u)dudx=0.(24)
View SourceRight-click on figure for MathML and additional features.The policy gradient theorem with function approximation is based on the equality in (24).

Theorem 2 (Policy Gradient with Function Approximation): If hw satisfies (24) and
∇whw(x,u)=∇ϑlnπϑ(x,u)(25)
View SourceRight-click on figure for MathML and additional features.where πϑ(x,u) denotes the stochastic policy, parameterized by ϑ, then
∇ϑJ=∫Xdπ(x)∫U∇ϑπ(x,u)hw(x,u)dudx.(26)
View SourceRight-click on figure for MathML and additional features.

Proof: See [19].▪

An extra assumption in [1] is that in (25), h actually needs to be an approximator that is linear with respect to some parameter w and features ψ, i.e., hw=w⊤ψ(x,u), transforming condition (25) into
ψ(x,u)=∇ϑlnπϑ(x,u).(27)
View SourceRight-click on figure for MathML and additional features.Features ψ that satisfy (27) are known as compatible features [1], [19], [51]. in the remainder of the paper, these will always be denoted by ψ and their corresponding parameters with w.

A technical issue, discussed in detail in [16] and [19], is that using the compatible function approximation hw=w⊤∇ϑlnπϑ(x,u) gives
∫Uπϑ(x,u)hw(x,u)du=w⊤∇ϑ∫Uπϑ(x,u)du=1=0.
View SourceRight-click on figure for MathML and additional features.This shows that the expected value of hw(x,u) under the policy πϑ is zero for each state, from which it can be concluded that hw is generally better thought of as the advantage function Aπ(x,u)=Qπ(x,u)−Vπ(x). in essence, this means that using only compatible features for the value function results in an approximator that can only represent the relative value of an action u in some state x correctly, but not the absolute value Q(x,u). An example showing how different the value function Q(x,u) and the corresponding advantage function A(x,u) can look is shown in Fig. 2. Because of this difference, the policy gradient estimate produced by just the compatible approximation will still have a high variance.
To lower the variance, extra features have to be added on top of the compatible features, which take the role of modeling the difference between the advantage function Aπ(x,u) and the state-action value function Qπ(x,u), i.e., the value function Vπ(x). These extra features are, therefore, only state-dependent, as dependence on the action would introduce a bias into the gradient estimate. The state-dependent offset that is created by these additional features is often referred to as a (reinforcement) baseline.
The policy gradient theorem actually generalizes to the case where a state-dependent baseline function is taken into account. Equation (26) would then read
∇ϑJ=∫Xdπ(x)∫U∇ϑπ(x,u)[hw(x,u)+b(x)]dudx.(28)
View SourceRight-click on figure for MathML and additional features.where b(x) is the baseline function that can be chosen arbitrarily.
Adding a baseline will not affect the unbiasedness of the gradient estimate, but can improve the accuracy of the critic’s approximation and prevent an ill-conditioned projection of the value function on the compatible features ψ [1]. in that respect, this paper treats w as a subset of θ and ψ as a subset of ϕ. in practice, the optimal baseline, i.e., the baseline that minimizes the variance in the gradient estimate for the policy π, is the value function Vπ(x) [19], [20]. in [52], it is noted that, in light of the policy gradient theorem that was only published many years later, Gullapalli’s earlier idea in [32] of using the TD δ in the gradient used to update the policy weights can be shown to yield the true policy gradient ∇ϑJ(ϑ) and, hence, corresponds to the policy gradient theorem with respect to (28).

Fig. 2. - Optimal value and advantage function for the example MDP in [16]. The system is ${x}_{k+1} = {x}_k+u_k$ , using the optimal policy $\pi^\ast ({x}) = -K{x}$ with $K$ the state feedback solution based on the reward function $r_k= -{x}_k^2 - 0.1u_k^2$ . The advantage function nicely shows the zero contour line of the optimal action $u= -K{x}$ . (a) Value function $Q^\ast ({x},u)$ . (b) Advantage function $A^\ast ({x},u)$ .
Fig. 2.
Optimal value and advantage function for the example MDP in [16]. The system is xk+1=xk+uk, using the optimal policy π∗(x)=−Kx with K the state feedback solution based on the reward function rk=−x2k−0.1u2k. The advantage function nicely shows the zero contour line of the optimal action u=−Kx. (a) Value function Q∗(x,u). (b) Advantage function A∗(x,u).

Show All

Theorem 2 yields a major benefit. Once a good parameterization for a policy has been found, a parameterization for the value function automatically follows and also guarantees convergence. Further on in this paper, many actor-critic algorithms make use of this theorem.

Part of this paper is dedicated to giving some examples of relevant actor-critic algorithms in both the standard gradient and natural gradient setting.
As it is not possible to describe all existing actor-critic algorithms in this survey in detail, the algorithms addressed in this paper are chosen based on their originality: either they were the first to use a certain technique, extended an existing method significantly or the containing paper provided an essential analysis. in Section II, a distinction between the discounted and average reward setting was already made. The reward setting is the first major axis along which the algorithms in this paper are organized.
The second major axis is the gradient type, which will be either the standard gradient or the natural gradient. This results in a total of four categories to which the algorithms can (uniquely) belong (see Table I). References in bold are discussed from an algorithmic perspective. Section IV describes actor-critic algorithms that use a standard gradient. Section V first introduces the concept of a natural gradient, after which natural actor-critic algorithms are discussed. References in italic are discussed in the Section VI on applications.

Table I Actor-Critic Methods, Categorized Along Two Axes: Gradient Type and Reward Setting
Table I- Actor-Critic Methods, Categorized Along Two Axes: Gradient Type and Reward Setting
SECTION IV.Standard Gradient Actor-Critic Algorithms
Many papers refer to Barto et al. [46] as the starting point of actor-critic algorithms, although there the actor and critic were called the associative search element and adaptive critic element, respectively. That paper itself mentions that the implemented strategy is closely related to [45] and [69]. Either way, it is true that Barto et al. [46] defined the actor-critic structure that resembles the recently proposed actor-critic algorithms the most. Therefore, the discussion on standard actor-critic algorithms starts with this work, after which other algorithms in the discounted return setting are discussed.
As many algorithms based on the average return also exist, they are dealt with in a separate section.

A. Discounted Return Setting
Barto et al. [46] use simple parameterizations
Vθ(x)=θ⊤ϕ(x),πϑ(x)=ϑ⊤ϕ(x)
View SourceRight-click on figure for MathML and additional features.with the same features ϕ(x) for the actor and critic. They chose binary features, i.e., for some state x only one feature ϕi(x) has a nonzero value, in this case ϕi(x)=1. for ease of notation, the state x was taken to be a vector of zeros with only one element equal to 1, indicating the activated feature. This allowed the parameterization to be written as
Vθ(x)=θ⊤x,πϑ(x)=ϑ⊤x.
View SourceRight-click on figure for MathML and additional features.Then, they were able to learn a solution to the well-known cart-pole problem using the update equations
δkzc,kza,kθk+1ϑk+1=rk+1+γVθk(xk+1)−Vθk(xk)=λczc,k−1+(1−λc)xk=λaza,k−1+(1−λa)ukxk=θk+αcδkzc,k=ϑk+αaδkza,k(29a)(29b)(29c)(29d)(29e)
View SourceRight-click on figure for MathML and additional features.

with
uk=τ(πϑk(xk)+nk)
View SourceRight-click on figure for MathML and additional features.where τ is a threshold, sigmoid, or identity function, nk is noise which accounts for exploration, and zc, za are eligibility traces for the critic and actor, respectively. Note that these update equations are similar to the ones in template (19), considering the representation in binary features. The product δkza,k in (29e) can then be interpreted as the gradient of the performance with respect to the policy parameter.

Although no use was made of advanced function approximation techniques, good results were obtained. A mere division of the state space into boxes meant that there was no generalization among the states, indicating that learning speeds could definitely be improved upon.
Nevertheless, the actor-critic structure itself remained and later work largely focused on better representations for the actor and the calculation of the critic.

Based on earlier work in [53], Wang et al. [54] introduced the fuzzy actor-critic reinforcement learning network (FACRLN), which uses only one fuzzy neural network based on radial basis functions for both the actor and the critic.
That is, they both use the same input and hidden layers, but differ in their output by using different weights. This is based on the idea that both actor and critic have the same input and also depend on the same system dynamics.
Apart from the regular updates to the actor and critic based on the TD error, the algorithm not only updates the parameters of the radial basis functions in the neural network, but also adaptively adds and merges fuzzy rules.
Whenever the TD error or the squared TD error is too high and the so-called ϵ-completeness property [70] is violated, a new rule, established by a new radial basis function, is added to the network. A closeness measure of the radial basis functions decides whether two (or more) rules should be merged into one.
for example, when using Gaussian functions in the network, if two rules have their centers and their widths close enough to each other, they will be merged into one. FACRLN is benchmarked against several other (fuzzy) actor-critic algorithms, including the original work in [46], and turns out to outperform them all in terms of the number of trials needed, without increasing the number of basis functions significantly.

At about the same time, Niedzwiedz et al. [55] also claimed, like with FACRLN, that there is redundancy in learning separate networks for the actor and critic and developed their consolidated actor-critic model (CACM) based on that same principle.
They too set up a single neural network, using sigmoid functions instead of fuzzy rules, and use it for both the actor and the critic. The biggest difference is that here, the size of the neural network is fixed, i.e., there is no adaptive insertion/removal of sigmoid functions.

More recently, work by Bhatnagar on the use of actor-critic algorithms using function approximation for discounted cost MDPs under multiple inequality constraints appeared in [56]. The constraints considered are bounds on the expected values of discounted sums of single-stage cost functions ρn, i.e.,
Sn(π)=∑x∈Xd0(x)Wπn(x)≤sn,n=1…N
View SourceRight-click on figure for MathML and additional features.with
Wπn(x)=E{∑k=0∞γkρn(xk,uk)∣∣∣x0=x,π}
View SourceRight-click on figure for MathML and additional features.and d0 a given initial distribution over the states. The approach is, as in usual constrained optimization problems, to extend the discounted cost function J(π) to a Lagrangian cost function
L(π,μ¯)=J(π)+∑n=1NμkGn(π)
View SourceRight-click on figure for MathML and additional features.where μ¯=(μ1,…,μN)⊤ is the vector of Lagrange multipliers, and Gn(π)=Sn(π)−sn are the functions representing the inequality constraints.

The algorithm generates an estimate of the policy gradient using simultaneous perturbation stochastic approximation (SPSA) [71], which has been found to be efficient even in high-dimensional parameter spaces. The SPSA gradient requires the introduction of two critics instead of one. The first critic, parameterized by θ⊤ϕ(x), evaluates a policy parameterized by ϑk. The second critic, parameterized by θ′⊤ϕ(x), evaluates a slightly perturbed policy parameterized by ϑk+ϵΔk with a small ϵ>0. The element-wise policy parameter update is then given by4
ϑi,k+1=Γiϑk+αa∑x∈Xd0(x)((θk−θ′k)⊤ϕ(x)ϵΔi,k)
View SourceRight-click on figure for MathML and additional features.where Γi is a truncation operator. The Lagrange parameters μ also have an update rule of their own (further details in [56]), which introduces a third learning rate αL,k into the algorithm for which the regular conditions
∑kαL,k=∞,∑kα2L,k<∞
View SourceRight-click on figure for MathML and additional features.must be satisfied and another constraint relating αL,k to the actor step size αa,k
limk→∞αL,kαa,k=0
View SourceRight-click on figure for MathML and additional features.must also hold, indicating that the learning rate for the Lagrange multipliers should decrease quicker than the actor’s learning rate. Under these conditions, the authors prove the almost sure convergence to a locally optimal policy.
该算法使用同步扰动随机近似（SPSA） [71] 生成策略梯度的估计值，即使在高维参数空间中也被发现是有效的。SPSA梯度要求引入两名批评者，而不是一名批评者。第一个批评者（参数化为）评估参数化为 θ⊤ϕ(x) ϑk 的策略。第二个批评者，参数化为，评估一个稍微受干扰的策略，参数化为 θ′⊤ϕ(x) ϑk+ϵΔk ϵ>0 小 .然后， 4
ϑi,k+1=Γiϑk+αa∑x∈Xd0(x)((θk−θ′k)⊤ϕ(x)ϵΔi,k)
View SourceRight-click on figure for MathML and additional features.由 where Γi 是截断运算符给出元素策略参数更新。拉格朗日参数 μ 也有自己的更新规则（详见 [56] ），该规则在算法中引入了第三个学习率，必须满足常规条件
∑kαL,k=∞,∑kα2L,k<∞
View SourceRight-click on figure for MathML and additional features.，并且还必须保持与参与者步长 αa,k
limk→∞αL,kαa,k=0
View SourceRight-click on figure for MathML and additional features.相关的 αL,k 另一个约束，表明拉格朗日乘子的学习率应该比参与者的学习率 αL,k 下降得更快。在这些条件下，作者证明了几乎可以肯定地收敛到局部最优策略。

B. Average Reward Setting
B. 平均奖励设定
In [1], together with the presentation of the novel ideas of compatible features, discussed in Section III-D, two actor-critic algorithms were introduced, differing only in the way they update the critic. The general update equations for these algorithms are
Jˆkδkθk+1ϑk+1=Jˆk−1+αc,k(rk+1−Jˆk−1)=rk+1−Jˆk+Qθk(xk+1,uk+1)−Qθk(xk,uk)=θk+αc,kδkzk=ϑk+αa,kΓ(θk)Qθk(xk,uk)ψ(xk,uk)(31a)(31b)(31c)(31d)
View SourceRight-click on figure for MathML and additional features.where ψ is the vector of compatible features as defined in (27), and the parameterization Qθ also contains these compatible features. The first and the second equation depict the standard update rules for the estimate of the average cost and the TD error. The third equation is the update of the critic. Here, the vector zk represents an eligibility trace [7] and it is exactly this what distinguishes the two different algorithms described in the paper. The first algorithm uses a TD(1) critic, basically taking an eligibility trace with decay rate λ=1. The eligibility trace is updated as
zk={zk−1+ϕk(xk,uk),ϕk(xk,uk),ifxk≠xSotherwise
View SourceRight-click on figure for MathML and additional features.where xS is a special reset state for which it is assumed that the probability of reaching it from any initial state x within a finite number of transitions is bounded away from zero for any sequence of randomized stationary policies. Here, the eligibility trace is reset when encountering a state that meets this assumption. The second algorithm is a TD(λ) critic, simply updating the eligibility trace as
zk=λzk−1+ϕk(xk,uk).
View SourceRight-click on figure for MathML and additional features.The update of the actor in (31d) uses the policy gradient estimate from Theorem 2. It leaves out the state distribution dπ(x) earlier seen in (26) of the policy gradient theorem, as the expected value of ∇J(ϑk) is equal to that of ∇ϑπ(x,u)Qˆπw(x,u), and puts the critic’s current estimate in place of Qˆπw(x,u). Finally, Γ(θk) is a truncation term to control the step size of the actor, taking into account the current estimate of the critic. for this particular algorithm, some further assumptions on the truncation operator Γ must hold, which are not listed here.
在 [1] 中，连同中讨论的兼容特征的新思想的介绍，引入了两种演员-评论家算法 Section III-D ，区别仅在于它们更新评论家的方式。这些算法的一般更新方程是
Jˆkδkθk+1ϑk+1=Jˆk−1+αc,k(rk+1−Jˆk−1)=rk+1−Jˆk+Qθk(xk+1,uk+1)−Qθk(xk,uk)=θk+αc,kδkzk=ϑk+αa,kΓ(θk)Qθk(xk,uk)ψ(xk,uk)(31a)(31b)(31c)(31d)
View SourceRight-click on figure for MathML and additional features.ψ (27) 中定义的兼容特征的向量，参数化 Qθ 也包含这些兼容特征。第一个和第二个等式描述了估计平均成本和TD误差的标准更新规则。第三个等式是批评家的更新。在这里，向量 zk 表示资格跟踪 [7] ，正是这一点区分了本文中描述的两种不同算法。第一种算法使用TD（1）批评者，基本上是采用衰减率 λ=1 的资格跟踪。资格跟踪将更新为
zk={zk−1+ϕk(xk,uk),ϕk(xk,uk),ifxk≠xSotherwise
View SourceRight-click on figure for MathML and additional features.，其中是一个特殊的重置状态，对于该状态， xS 假定对于任何随机稳态策略序列，从有限数量的转换中的任何初始状态到达该状态 x 的概率都与零相距甚远。在这里，当遇到满足此假设的状态时，将重置资格跟踪。第二种算法是 TD（ λ ）批评者，只需将资格跟踪更新为
zk=λzk−1+ϕk(xk,uk).
View SourceRight-click on figure for MathML and additional features.参与者的更新 (31d) 使用定理 2 中的策略梯度估计。它省略了先前在政策梯度定理中看到的状态分布 dπ(x) ，因为的期望值等于 ∇ϑπ(x,u)Qˆπw(x,u) 的期望值 ∇J(ϑk) ，并用批评者的当前估计代替了 Qˆπw(x,u) 。 (26) 最后，是一个截断项，用于控制演员的步长， Γ(θk) 同时考虑到评论家的当前估计。对于此特定算法，必须对截断运算符 Γ 进行一些进一步的假设，此处未列出这些假设。

It is known that using least-squares TD methods for policy evaluation is superior to using regular TD methods in terms of convergence rate as they are more data efficient [3], [47]. Inevitably, this resulted in work on actor-critic methods using an LSTD critic [52], [72]. However, Paschalidis et al. [50] showed that it is not straightforward to use LSTD without modification, as it undermines the assumptions on the step sizes (21), (22). As a result of the basics of LSTD, the step size schedule for the critic should be chosen as αc,k=1k. Plugging this demand into (21) and (22) two conditions on the step size of the actor conflict, i.e.,
∑kαa,k=∞,∑k(kαa,k)d<∞,for somed>0.
View SourceRight-click on figure for MathML and additional features.They conflict because the first requires αa to decay at a rate slower than 1/k, while the second demands a rate faster than 1/k. This means there is a tradeoff between the actor having too much influence on the critic and the actor decreasing its learning rate too fast. The approach presented in [50] to address this problem is to use the following step size schedule for the actor. for some K>>1, let L=⌊k/K⌋ and
αa,k:=1L+1α^a(k+1−LK)
View SourceRight-click on figure for MathML and additional features.where ∑k(kα^a(k))d≤∞ for some d>0. As a possible example
α^a,k(b):=ϱ©⋅b−C
View SourceRight-click on figure for MathML and additional features.is provided, where C>1 and ϱ©>0. The critic’s step size schedule is redefined as
αc,k:=1k−κ(L,K).
View SourceRight-click on figure for MathML and additional features.Two extreme cases of κ(L,K) are κ(L,K) =△ 0 and κ(L,K)=LK−1. The first alternative corresponds to the unmodified version of LSTD and the latter corresponds to “restarting” the LSTD procedure when k is an integer multiple of K. The reason for adding the κ term to the critic update is theoretical, as it may be used to increase the accuracy of the critic estimates for k→∞. Nevertheless, choosing κ(L,K)=0 gave good results in the simulations in [50]. These step size schedules for the actor and critic allow the critic to converge to the policy gradient, despite the intermediate actor updates, while constantly reviving the learning rate of the actor such that the policy updates do not stop prematurely.
The actor step size schedule does not meet the requirement ∑k(kαa)d<∞ for some d>0, meaning that convergence of the critic for the entire horizon cannot be directly established. What is proven by the authors is that the critic converges before every time instant k=JK, at which point a new epoch starts.5 for the actor, the optimum is not reached during each epoch, but in the long run, it will move to an optimal policy. A detailed proof of this is provided in [50].

Berenji and Vengerov [5] used the analysis of [1] to provide a proof of convergence for an actor-critic fuzzy reinforcement learning (ACFRL) algorithm. The fuzzy element of the algorithm is the actor, which uses a parameterized fuzzy Takagi–Sugeno rulebase.
The authors show that this parameterization adheres to the assumptions needed for convergence stated in [1], hence providing the convergence proof. The update equations for the average cost and the critic are the same as (31a) and (31c), but the actor update is slightly changed into
ϑk+1=Γ(ϑk+αa,kθ⊤kϕk(xk,uk)ψk(xk,uk))
View SourceRight-click on figure for MathML and additional features.i.e., the truncation operator Γ is now acting on the complete update expression, instead of limiting the step size based on the critic’s parameter.
While applying ACFRL to a power management control problem, it was acknowledged that the highly stochastic nature of the problem and the presence of delayed rewards necessitated a slight adaptation to the original framework in [1]. The solution was to split the updating algorithm into three phases. Each phase consists of running a finite number of simulation traces. The first phase only estimates the average cost J^, keeping the actor and critic fixed. The second phase only updates the critic, based on the J^ obtained in the previous phase. This phase consists of a finite number of traces during which a fixed positive exploration term is used on top of the actor’s output and an equal number of traces during which a fixed negative exploration term is used. The claim is that this systematic way of exploring is very beneficial in problems with delayed rewards, as it allows the critic to better establish the effects of a certain direction of exploration.
The third and final phase keeps the critic fixed and lets the actor learn the new policy. Using this algorithm, ACFRL consistently converged to the same neighborhood of policy parameters for a given initial parameterization. Later, the authors extended the algorithm to ACFRL-2 in [66], which took the idea of systematic exploration one step further by learning two separate critics: one for positive exploration and one for negative exploration.

Bhatnagar et al. [20] introduced four algorithms. The first one uses a regular gradient and will, therefore, be discussed in this section. The update equations for this algorithm are
Jˆkδkθk+1ϑk+1=Jˆk−1+αJ,k(rk+1−Jˆk−1)=rk+1−J^k+Vθk(xk+1)−Vθk(xk)=θk+αc,kδkϕ(xk)=Γ(ϑk+αa,kδkψ(xk,uk)).(32a)(32b)(32c)(32d)
View SourceRight-click on figure for MathML and additional features.

The critic update is simply an update in the direction of the gradient ∇θV. The actor update uses the fact that δkψ(xk,uk) is an unbiased estimate of ∇ϑJ under conditions mentioned in [20]. The operator Γ is a projection operator, ensuring boundedness of the actor update. Three more algorithms are discussed in [20], but these make use of a natural gradient for the updates and hence are discussed in Section V-C2.

SECTION V.Natural Gradient Actor-Critic Algorithms
The previous section introduced actor-critic algorithms which use standard gradients. The use of standard gradients comes with drawbacks.
Standard gradient descent is most useful for cost functions that have a single minimum and whose gradients are isotropic in magnitude with respect to any direction away from its minimum [73]. in practice, these two properties are almost never true. The existence of multiple local minima of the cost function, for example, is a known problem in RL, usually overcome by exploration strategies.
Furthermore, the performance of methods that use standard gradients relies heavily on the choice of a coordinate system over which the cost function is defined. This noncovariance is one of the most important drawbacks of standard gradients [51], [74]. An example for this will be given later in this section.

In robotics, it is common to have a curved state space (manifold) because of the presence of angles in the state. A cost function will then usually be defined in that curved space too, possibly causing inefficient policy gradient updates to occur. This is exactly what makes the natural gradient interesting, as it incorporates knowledge about the curvature of the space into the gradient. It is a metric based not on the choice of coordinates, but on the manifold that those coordinates parameterize [51].

This section is divided into two parts. The first part explains what the concept of a natural gradient is and what its effects are in a simple optimization problem, i.e., not considering a learning setting.
The second part is devoted to actor-critic algorithms that make use of this type of gradient to update the actor. As these policy updates are using natural gradients, these algorithms are also referred to as natural policy gradient algorithms.

A. Natural Gradient in Optimization
To introduce the notion of a natural gradient, this section summarizes work presented in [73]–[75]. Suppose a function J(ϑ) is parameterized by ϑ. When ϑ lives in a Euclidean space, the squared Euclidean norm of a small increment Δϑ is given by the inner product
∥Δϑ∥2E=Δϑ⊤Δϑ.
View SourceRight-click on figure for MathML and additional features.A steepest descent direction is then defined by minimizing J(ϑ+Δϑ), while keeping ∥Δϑ∥E equal to a small constant. When ϑ is transformed to other coordinates ϑ˜ in a non-Euclidean space, the squared norm of a small increment Δϑ˜ with respect to that Riemannian space is given by the product
∥Δϑ˜∥2R=Δϑ˜⊤G(ϑ˜)Δϑ˜
View SourceRight-click on figure for MathML and additional features.where G(ϑ˜) is the Riemannian metric tensor, an n×n positive-definite matrix characterizing the intrinsic local curvature of a particular manifold in an n-dimensional space. The Riemannian metric tensor G(ϑ˜) can be determined from the relationship [73]:
∥Δϑ∥2E=∥Δϑ˜∥2R.
View SourceRight-click on figure for MathML and additional features.Clearly, for Euclidean spaces, G(ϑ˜) is the identity matrix.

Standard gradient descent for the new parameters ϑ˜ would define the steepest descent with respect to the norm ∥Δϑ˜∥2=Δϑ˜⊤Δϑ˜. However, this would result in a different gradient direction, despite keeping the same cost function and only changing the coordinates. The natural gradient avoids this problem, and always points in the “right” direction, by taking into account the Riemannian structure of the parameterized space over which the cost function is defined. So now, J˜(ϑ˜+Δϑ˜) is minimized while keeping ∥Δϑ˜∥R small (J˜ here is just the original cost J, but written as a function of the new coordinates). This results in the natural gradient ∇˜ϑ˜J˜(ϑ˜) of the cost function, which is just a linear transformation of the standard gradient ∇ϑ˜J˜(ϑ˜) by the inverse of G(ϑ˜):
∇˜ϑ˜J˜(ϑ˜)=G−1(ϑ˜)∇ϑ˜J˜(ϑ˜).
View SourceRight-click on figure for MathML and additional features.

As an example of optimization with a standard gradient versus a natural gradient, consider a cost function based on polar coordinates
JP(r,φ)=12[(rcosφ−1)2+r2sin2φ].(33)
View SourceRight-click on figure for MathML and additional features.This cost function is equivalent to JE(x,y)=(x−1)2+y2, where x and y are Euclidean coordinates, i.e., the relationship between (r,φ) and (x,y) is given by
x=rcosφ,y=rsinφ.
View SourceRight-click on figure for MathML and additional features.

Fig. 3(a) shows the contours and antigradients of JP(r,φ) for 0≤r≤3 and |φ|≤π, where
−∇(r,φ)JP(r,φ)=−[r−cosφrsinφ].
View SourceRight-click on figure for MathML and additional features.The magnitude of the gradient clearly varies widely over the (r,φ)-plane. When performing a steepest descent search on this cost function, the trajectories from any point (r,φ) to an optimal one will be far from straight paths. for the transformation of Euclidean coordinates into polar coordinates, the Riemannian metric tensor is [73]
G(r,φ)=[100r2]
View SourceRight-click on figure for MathML and additional features.so that the natural gradient of the cost function in (33) is
−∇˜(r,φ)JP(r,φ)=−G(r,φ)−1∇(r,φ)JP(r,φ)=−⎡⎣r−cosφsinφr⎤⎦.
View SourceRight-click on figure for MathML and additional features.Fig. 3(b) shows the natural gradients of JP(r,φ). Clearly, the magnitude of the gradient is now more uniform across the space, and the angles of the gradients also do not greatly vary away from the optimal point (1,0).

Fig. 3. - (a) Standard and (b) natural gradients of the cost function $J_P(r,\varphi)$ in polar coordinates.
Fig. 3.
(a) Standard and (b) natural gradients of the cost function JP(r,φ) in polar coordinates.

Show All

Fig. 4 shows the difference between a steepest descent method using a standard gradient and a natural gradient on the cost JP(r,φ) using a number of different initial conditions.
The natural gradient clearly performs better as it always finds the optimal point, whereas the standard gradient generates paths that are leading to points in the space which are not even feasible, because of the radius which needs to be positive.

Fig. 4. - Trajectories for standard gradient (dashed) and natural gradient (solid) algorithms for minimizing $J_P(r,\varphi)$ in polar coordinates.
Fig. 4.
Trajectories for standard gradient (dashed) and natural gradient (solid) algorithms for minimizing JP(r,φ) in polar coordinates.

Show All

To get an intuitive understanding of what the effect of a natural gradient is, Fig. 5 shows trajectories for the standard and natural gradient that have been transformed onto the Euclidean space. Whatever the initial condition6 is, the natural gradient of JP(r,φ) always points straight to the optimum and follows the same path that the standard gradient of JE(x,y) would do.

Fig. 5. - Trajectories for standard gradient (dashed) and natural gradient (solid) algorithms for minimizing $J_P(r,\varphi)$ , transformed to Euclidean coordinates.
Fig. 5.
Trajectories for standard gradient (dashed) and natural gradient (solid) algorithms for minimizing JP(r,φ), transformed to Euclidean coordinates.

Show All

When J(ϑ) is a quadratic function of ϑ (like in many optimization problems, including for example those solved in supervised learning), the Hessian H(ϑ) is equal to G(ϑ) for the underlying parameter space, and there is no difference between using Newton’s method and natural gradient optimization. in general however, natural gradient optimization differs from Newton’s method, e.g., G(ϑ) is always positive definite by construction, whereas the Hessian H(ϑ) may not be [73]. The general intuition developed in this section is essential before moving on to the natural policy gradient in MDPs, explained next.

B. Natural Policy Gradient
The possibility of using natural gradients in online learning was first appreciated in [75]. As shown above, the crucial property of the natural gradient is that it takes into account the structure of the manifold over which the cost function is defined, locally characterized by the Riemannian metric tensor. To apply this insight in the context of policy gradient methods, the main question is then what is an appropriate manifold, and once that is known, what is its Riemannian metric tensor.

Consider first just the parameterized stochastic policy πϑ(x,u) at a single state x, a probability distribution over the actions u. This policy is a point on a manifold of such probability distributions, found at coordinates ϑ. for a manifold of distributions, the Riemannian tensor is the so-called Fisher information matrix (FIM) [75], which for the policy above is [51]
F(ϑ,x)=E[∇ϑlnπϑ(x,u)∇ϑlnπϑ(x,u)⊤]=∫Uπϑ(x,u)∇ϑlnπϑ(x,u)∇ϑlnπϑ(x,u)⊤du.(34)
View SourceRight-click on figure for MathML and additional features.The single-state policy is directly related with the expected immediate reward, over a single step from x. However, it does not tell much about the overall expected return J(π), which is defined over entire state trajectories. To obtain an appropriate overall FIM, in the average reward case, Kakade [51] made the intuitive choice of taking the expectation of F(ϑ,x) with respect to the stationary state distribution dπ(x)
F(ϑ)=∫Xdπ(x)F(ϑ,x)dx.(35)
View SourceRight-click on figure for MathML and additional features.He was, however, unsure whether this was the right choice.

Later on, the authors of [52] and [74] independently showed that (35) is indeed a true FIM, for the manifold of probability distributions over trajectories in the MDP. When used to control the MDP with stochastic dynamics f, πϑ(x,u) gives rise to different controlled trajectories with different probabilities; therefore, each value of the parameter ϑ yields such a distribution over trajectories. To understand how this distribution is relevant to the value J(π) of the policy, observe that this value can be written as the expected value of the infinite-horizon return over all possible paths, where the expectation is taken with respect to precisely the trajectory distribution. Furthermore, in [52] and [74], the authors show that this idea also extends to the discounted reward case, where the FIM is still given by (35), only with dπ(x) replaced by the discounted state distribution dπγ(x), as defined in Section II-A.

Examples of the difference in performance between regular policy gradients and natural policy gradients are provided in [51], [52], [74].

C. Natural Actor-Critic Algorithms
This section describes several representative actor-critic algorithms that employ a natural policy gradient. Again, a distinction is made between algorithms using the discounted return and the average return.

Discounted Return Setting
After the acknowledgement by Amari [75] that using the natural gradient could be beneficial for learning, the aptly called natural actor-critic algorithm in [52] was, to the best of our knowledge, the first actor-critic algorithm that successfully employed a natural gradient for the policy updates. Together with [51], they gave a proof that the natural gradient ∇˜ϑJ(ϑ) is in fact the compatible feature parameter w of the approximated value function, i.e.,
∇˜ϑJ(ϑ)=w.
View SourceRight-click on figure for MathML and additional features.Consequently, they were able to use a natural gradient without explicitly calculating the FIM. This turns the policy update step into
ϑk+1=ϑk+αa∇˜ϑJ(ϑ)=ϑk+αawk+1.(36a)(36b)
View SourceRight-click on figure for MathML and additional features.

For the policy evaluation step of the algorithm, i.e., the calculation of the critic parameter w, LSTD-Q(λ) was used, which was their own extension to LSTD(λ) from [3]. The natural actor-critic outperformed standard gradient policy gradient methods on a cart-pole balancing setup. Later, the work was extended in [16], where it was shown that several well-known reinforcement algorithms (e.g., Sutton and Barto’s actor-critic [7] and Bradtke’s Q-learning [23]) are strongly related to natural actor-critic algorithms. Furthermore, the paper presents the successful application of an episodic variant of natural actor-critic (eNAC) on an anthropomorphic robot arm.
for another example of a natural-actor critic algorithm with a regression-based critic, see [76].

Park et al. [60] extend the original work in [52] by using a recursive least-squares method in the critic, making the parameter estimation of the critic more efficient. They then successfully apply it to the control of a two-link robot arm.

Girgin and Preux [61] improve the performance of natural actor-critic algorithms, by using a neural network for the actor, which includes a mechanism to automatically add hidden layers to the neural network if the accuracy is not sufficient. Enhancing the eNAC method in [16] with this basis expansion method clearly showed its benefits on a cart-pole simulation.

Although a lot of (natural) actor-critic algorithms use sophisticated function approximators, Kimura showed in [62] that a simple policy parameterization using rectangular coarse coding can still outperform conventional Q-learning in high-dimensional problems. in the simulations, however, Q-learning did outperform the natural actor-critic algorithm in low-dimensional problems.

Average reward setting
Bhatnagar et al. [20] introduced four algorithms, three of which are natural-gradient algorithms. They extend the results of [1] by using TD learning for the actor and by incorporating natural gradients. They also extend the work of [16] by providing the first convergence proofs and the first fully incremental natural actor-critic algorithms. The contribution of convergence proofs for natural-actor critic algorithms is important, especially since the algorithms utilized both function approximation and a bootstrapping critic, a combination which is essential to large-scale applications of RL.
The second algorithm only differs from the first algorithm, described at the end of Section IV-B with (32), in the actor update (32d). It directly substitutes the standard gradient with the natural gradient
ϑk+1=Γ(ϑk+αa,kF−1k(ϑ)δkψ(xk,uk))(37)
View SourceRight-click on figure for MathML and additional features.where F is the FIM. This requires the actual calculation of the FIM. Since the FIM can be written using the compatible features ψ as
F(ϑ)=∫Xdπ(x)∫Uπ(x,u)ψ(x,u)ψ⊤(x,u)dudx
View SourceRight-click on figure for MathML and additional features.sample averages can be used to compute it
Fk(ϑ)=1k+1∑l=0kψ(xl,ul)ψ⊤(xl,ul).
View SourceRight-click on figure for MathML and additional features.After converting this equation to a recursive update rule, and putting the critic’s learning rate in place, the Sherman–Morrison matrix inversion lemma is used to obtain an iterative update rule for the inverse of the FIM7
F−1k(ϑ)=11−αc,k⋅[F−1k−1−αc,k(F−1k−1ψk)(F−1k−1ψk)⊤1−αc,k(1−ψ⊤kF−1k−1ψk)]
View SourceRight-click on figure for MathML and additional features.where the initial value F−10 is chosen to be a scalar multiple of the identity matrix. This update rule, together with the adjusted update of the actor, then form the second algorithm.

The third algorithm in [20] uses the fact that the compatible approximation w⊤ψ(x,u) is better thought of as an advantage function approximator instead of a state-action value function approximator, as mentioned in Section III-D. Hence, the algorithm tunes the weights w, such that the squared error Eπ(w)=E[(w⊤ψ(x,u)−Aπ(x,u))2] is minimized. The gradient of this error is
∇wEπ(w)=2∑Xdπ(x)∑Uπ(x,u)⋅[w⊤ψ(x,u)−Aπ(x,u)]ψ(x,u).
View SourceRight-click on figure for MathML and additional features.As δk is an unbiased estimate of Aπ(xk,uk) (see [77]), the gradient is estimated with
∇wEπˆ(w)=2(ψkψ⊤kw−δkψk)(38)
View SourceRight-click on figure for MathML and additional features.and the gradient descent update rule for w (using the same learning rate as the critic) is
wk+1=wk−αc,k(ψkψ⊤kwk−δkψk).(39)
View SourceRight-click on figure for MathML and additional features.Furthermore, the natural gradient estimate is given by w (as shown by Peters and Schaal [16]), and an explicit calculation for the FIM is no longer necessary. Therefore, the third algorithm is obtained by using (39) and replacing the actor update in (37) with
ϑk+1=Γ(ϑk+αa,kwk+1).(40)
View SourceRight-click on figure for MathML and additional features.

The fourth algorithm in [20] is obtained by combining the second and third algorithms. The explicit calculation of F−1k is now used for the update of the compatible parameter w. The update of w now also follows its natural gradient, by premultiplying the result in (38) with F−1k, i.e.,
∇˜wEπˆ(w)=2F−1k(ψkψ⊤kw−δkψk)
View SourceRight-click on figure for MathML and additional features.turning the update of w into
wk+1=wk−αc,kF−1k(ψkψ⊤kwk−δkψk)=wk−αc,kF−1kψkψ⊤kIwk+αc,kF−1kδkψk=wk−αc,kwk+αc,kF−1kδkψk
View SourceRight-click on figure for MathML and additional features.where clever use is made of the fact that Fk is written as the squared ψ’s. The actor update is still (40).

Although most natural actor-critic algorithms use the natural gradient as defined in Section V, the generalized natural actor-critic (gNAC) algorithm in [68] does not. Instead, a generalized natural gradient (gNG) is used, which combines properties of the FIM and natural gradient as defined before with the properties of a differently defined FIM and natural gradient from the work in [78]. They consider the fact that the average reward J(ϑ) is not only affected by the policy π, but also by the resulting state distribution dπ(x) and define the FIM of the state-action joint distribution as
FSA(ϑ)=FS(ϑ)+FA(ϑ)(41)
View SourceRight-click on figure for MathML and additional features.where FS(ϑ) is the FIM of the stationary state distribution dπ(x) and FA(ϑ) the FIM as defined in (35). in [78], the use of FSA(ϑ) as the FIM is considered better for learning than using the original FIM because of three reasons:

Learning with FSA(ϑ) still benefits from the concepts of natural gradient, since it necessarily and sufficiently accounts for the probability distributions that the average reward depends on.

FSA(ϑ) is analogous to the Hessian matrix of the average reward.

Numerical experiments have shown a strong tendency of avoiding plateaus in learning.

Nevertheless, the original FIM FA(ϑ) accounts for the distribution over an infinite amount of time steps, whereas FSA(ϑ) only accounts for the distribution over a single time step. This might increase the mixing time of the Markov chain drastically, making it hard for the RL learning agent to estimate a gradient with a few samples.
Therefore, the authors suggest to use a weighted average, using a weighting factor ι, of both FIMs defined in (34) and (41). The gNG is then calculated by using the inverse of this weighted average, leading to the policy gradient
∇˜ϑJ(ϑ)=(ιFS+FA)−1∇ϑJ(ϑ).
View SourceRight-click on figure for MathML and additional features.The implementation of the algorithm is similar to that of NAC, with the slight difference that another algorithm, LS LSD [79], is used to estimate ∇ϑdπ(x). If ι=0, gNAC is equivalent to the original NAC algorithm of Peters et al. [52], but now optimizing over the average return instead of the discounted return. in a numerical experiment with a randomly synthesized MDP of 30 states and 2 actions, gNAC with ι>0 outperformed the original NAC algorithm.

SECTION VI.Applications
This section provides references to papers that have applied actor-critic algorithms in several domains.
Note that the list of applications is not exhaustive and that other application domains for actor-critic algorithms and more literature on the applications mentioned below exists.

In the field of robotics, early successful results of using actor-critic type methods on real hardware were shown on a ball on a beam setup [80], a peg-in-hole insertion task [81], and biped locomotion [82]. Peters and Schaal showed in [16] that their natural actor-critic method was capable of getting an anthropomorphic robot arm to learn certain motor skills (see Fig. 6). Kim et al. [63] recently successfully applied a modified version of the algorithm in [60] to motor skill learning. Locomotion of a two-link robot arm was learned using a recursive least-squares natural actor-critic method in [60]. Another successful application on a real four-legged robot is given in [58]. Nakamura et al. devised an algorithm based on [16] which made a biped robot walk stably [64]. Underwater cable tracking [65] was done using the NAC method of [16], where it was used in both a simulation and real-time setting: Once the results from simulation were satisfactory, the policy was moved to an actual underwater vehicle, which continued learning during operation, improving the initial policy obtained from simulation.

Fig. 6. - Episodic natural actor-critic method in [16] applied to an anthropomorphic robot arm performing a baseball bat swing task.
Fig. 6.
Episodic natural actor-critic method in [16] applied to an anthropomorphic robot arm performing a baseball bat swing task.

Show All

An example of a logistics problem solved by actor-critic methods is given in [50], which successfully applies such a method to the problem of dispatching forklifts in a warehouse. This is a high-dimensional problem because of the number of products, forklifts, and depots involved.
Even with over 200 million discrete states, the algorithm was able to converge to a solution that performed 20% better in terms of cost than a heuristic solution obtained by taking the exact solution of a smaller problem and expanding this to a large state space.

Usaha and Barria [67] use the algorithm from [1] described in Section IV-B, extended to handle semi-MDPs,8 for call admission control in lower earth orbit satellite networks. They compared the performance of this actor-critic semi-Markov decision algorithm (ACSMDP) together with an optimistic policy iteration (OPI) method to an existing routing algorithm.
While both ACSMDP and OPI outperform the existing routing algorithm, ACSMDP has an advantage in terms of computational time, although OPI reaches the best result. Based on the FACRLN from [54] in Section IV-A, Li et al. [57] devised a way to control traffic signals at an intersection and showed in simulation that this method outperformed the commonly seen time slice allocation methods. Richter et al. [2] showed similar improvements in road traffic optimization when using natural actor-critic methods.

Finally, an application to the finance domain was described in [59], where older work on actor-critic algorithms [83] was applied in the problem of determining dynamic prices in an electronic retail market.

SECTION VII.Discussion and Outlook

When applying RL to a certain problem, knowing a priori whether a critic-only, actor-only, or actor-critic algorithm will yield the best control policy is virtually impossible. However, a few rules of thumb should help in selecting the most sensible class of algorithms to use.
The most important thing to consider first is the type of control policy that should be learned.
If it is necessary for the control policy to produce actions in a continuous space, critic-only algorithms are no longer an option, as calculating a control law would require solving the possibly non-convex optimization procedure of (11) over the continuous action space. Conversely, when the controller only needs to generate actions in a (small) countable, finite space, it makes sense to use critic-only methods, as (11) can be solved by enumeration. Using a critic-only method also overcomes the problem of high-variance gradients in actor-only methods and the introduction of more tuning parameters (e.g., extra learning rates) in actor-critic methods.

Choosing between actor-only and actor-critic methods is more straightforward. If the problem is modeled by a (quasi-)stationary MDP, actor-critic methods should provide policy gradients with lower variance than actor-only methods.
Actor-only methods are, however, more resilient to fast changing nonstationary environments, in which a critic would be incapable of keeping up with the time-varying nature of the process and would not provide useful information to the actor, canceling the advantages of using actor-critic algorithms. in summary, actor-critic algorithms are most sensibly used in a (quasi-)stationary setting with a continuous state and action space.

Once the choice for actor-critic has been made, the issue of choosing the right features for the actor and critic, respectively, remains. There is consensus, although, that the features for the actor and critic do not have to be chosen independently.
Several actor-critic algorithms use the exact same set of features for both the actor and the critic, while the policy gradient theorem indicates that it is best to first choose a parameterization for the actor, after which compatible features for the critic can be derived. in this sense, the use of compatible features is beneficial as it lessens the burden of choosing a separate parameterization for the value function. Note that compatible features do not eliminate the burden of choosing features for the value function completely (see Section III-D). Adding state-dependent features to the value function on top of the compatible features remains an important task as this is the only way to further reduce the variance in the policy gradient estimates. How to choose these additional features remains a difficult problem.

Choosing a good parameterization for the policy in the first place also remains an important issue as it highly influences the performance after learning.
首先为策略选择一个好的参数化也是一个重要问题，因为它会极大地影响学习后的性能。
Choosing this parameterization does seem less difficult than for the value function, as in practice it is easier to get an idea what shape the policy has than the corresponding value function.
选择这种参数化似乎比选择值函数要困难得多，因为在实践中，与相应的值函数相比，更容易了解策略的形状。

One of the conditions for successful application of RL in practice is that learning should be quick.
在实践中成功应用强化学习的条件之一是学习应该很快。
Although this paper focuses on gradient-based algorithms and how to estimate this gradient, it should be noted that it is not only the quality of the gradient estimate that influences the speed of learning.
虽然本文重点介绍基于梯度的算法以及如何估计这种梯度，但需要注意的是，影响学习速度的不仅仅是梯度估计的质量。
Balancing the exploration and exploitation of a policy and choosing good learning rate schedules also have a large effect on this, although more recently expectation-maximization methods that work without learning rates have been proposed [84], [85]. with respect to gradient type, the natural gradient seems to be superior to the standard gradient. However, an example of standard Q-learning on low-dimensional problems in [62] and relative entropy policy search [44] showed better results than the natural gradient. Hence, even though the field of natural gradient actor-critic methods is still a very promising area for future research, it does not always show superior performance compared with other methods.
平衡对政策的探索和利用以及选择良好的学习率时间表也对此有很大影响，尽管最近已经提出了 [84] 在没有学习率的情况下工作的期望最大化方法。 [85] 关于梯度类型，自然梯度似乎优于标准梯度。然而，在相对熵策略搜索 [44] 中 [62] ，对低维问题进行标准Q学习的一个例子显示出比自然梯度更好的结果。因此，尽管自然梯度演员-批评方法领域仍然是一个非常有前途的未来研究领域，但与其他方法相比，它并不总是表现出优越的性能。
A number of applications which use natural gradients are mentioned in this paper.
本文提到了许多使用自然梯度的应用。
The use of compatible features makes it straightforward to calculate approximations of natural gradients, which implies that any actor-critic algorithm developed in the future should attempt to use this type of gradient, as it speeds up learning without any real additional computational effort.
使用兼容特征可以直接计算自然梯度的近似值，这意味着未来开发的任何 actor-critic 算法都应该尝试使用这种类型的梯度，因为它可以加快学习速度，而无需任何真正的额外计算工作。

ACKNOWLEDGMENT 确认

The authors are grateful for the very helpful comments and suggestions that were received during the reviewing process.
作者感谢在审查过程中收到的非常有用的意见和建议。

Authors
Figures
References
Download PDFs
Export
References & Cited By
参考文献和引用文献
Select All 全选
1.
V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms”, SIAM J. Control Optim., vol. 42, no. 4, pp. 1143-1166, 2003.
V. R. Konda 和 J. N. Tsitsiklis，“On actor-critic algorithms”，SIAM J. Control Optim.，第 42 卷，第 4 期，第 1143-1166 页，2003 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
2.
S. Richter, D. Aberdeen and J. Yu, “Natural actor-critic for road traffic optimisation” in Advances in Neural Information Processing Systems 19, Cambridge, MA:MIT Press, pp. 1169-1176, 2007.
S. Richter、D. Aberdeen 和 J. Yu，“道路交通优化的自然参与者批评家”，载于 Advances in Neural Information Processing Systems 19，马萨诸塞州剑桥：麻省理工学院出版社，第 1169-1176 页，2007 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
3.
J. A. Boyan, “Technical update: Least-squares temporal difference learning”, Mach. Learn., vol. 49, pp. 233-246, 2002.
J. A. Boyan，“技术更新：最小二乘时间差分学习”，Mach. Learn.，第 49 卷，第 233-246 页，2002 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
4.
J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation”, J. Artif. Intell. Res., vol. 15, pp. 319-350, 2001.
J. Baxter 和 P. L. Bartlett，“无限地平线政策梯度估计”，J. Artif。智力。《研究》，第15卷，第319-350页，2001年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

H. R. Berenji and D. Vengerov, “A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters”, IEEE Trans. Fuzzy Syst., vol. 11, no. 4, pp. 478-485, Aug. 2003.
H. R. Berenji 和 D. Vengerov，“A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters”，IEEE Trans. Fuzzy Syst.，第 11 卷，第 4 期，第 478-485 页，2003 年 8 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
6.
R. S. Sutton, “Learning to predict by the methods of temporal differences”, Mach. Learn., vol. 3, pp. 9-44, 1988.
R. S. Sutton，“Learning to predict by the methods of temporal differences”，Mach. Learn.，第 3 卷，第 9-44 页，1988 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
7.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MA:MIT Press, 1998.
R. S. Sutton 和 A. G. Barto，强化学习：简介，马萨诸塞州剑桥：麻省理工学院出版社，1998 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
8.
L. P. Kaelbling, M. L. Littman and A. W. Moore, “Reinforcement learning: A survey”, J. Artif. Intell. Res., vol. 4, pp. 237-285, 1996.
L. P. Kaelbling、M. L. Littman 和 A. W. Moore，“强化学习：一项调查”，J. Artif。智力。《研究》，第4卷，第237-285页，1996年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
9.
A. Gosavi, “Reinforcement learning: A tutorial survey and recent advances”, INFORMS J. Comput., vol. 21, no. 2, pp. 178-192, 2009.
A. Gosavi，“强化学习：教程调查和最新进展”，INFORMS J. Comput.，第 21 卷，第 2 期，第 178-192 页，2009 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
10.
C. Szepesvári, “Algorithms for reinforcement learning” in Synthesis Lectures on Artificial Intelligence and Machine Learning, New, York:Morgan & Claypool, 2010.
C. Szepesvári，“强化学习算法”，载于人工智能和机器学习综合讲座，纽约：Morgan & Claypool，2010年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

T. Hanselmann, L. Noakes and A. Zaknich, “Continuous-time adaptive critics”, IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 631-647, May 2007.
T. Hanselmann、L. Noakes 和 A. Zaknich，“Continuous-time adaptive critics”，IEEE Trans. Neural Netw.，第 18 卷，第 3 期，第 631-647 页，2007 年 5 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
12.
K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem”, Automatica, vol. 46, no. 5, pp. 878-888, 2010.
K. G. Vamvoudakis 和 F. L. Lewis，“解决连续时间无限视界最优控制问题的在线演员-评论家算法”，《Automatica》，第 46 卷，第 5 期，第 878-888 页，2010 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

P. Pennesi and I. C. Paschalidis, “A distributed actor-critic algorithm and applications to mobile sensor network coordination problems”, IEEE Trans. Autom. Control, vol. 55, no. 2, pp. 492-497, Feb. 2010.
P. Pennesi 和 I. C. Paschalidis，“一种分布式参与者-批评算法及其在移动传感器网络协调问题中的应用”，IEEE Trans. Autom。《控制》，第 55 卷，第 2 期，第 492-497 页，2010 年 2 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
14.
C. Li, M. Wang and Q. Yuan, “A multi-agent reinforcement learning using actor-critic methods”, Proc. 7th Int. Conf. Mach. Learn. Cybern., pp. 878-882, 2008.
C. Li、M. Wang 和 Q. Yuan，“A multi-agent reinforcement learning using actor-critic methods”， Proc. 7th Int. Conf. Mach. Learn.Cybern.，第 878-882 页，2008 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
15.
L. Buşoniu, R. Babuška, B. De Schutter and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators, Boca Raton, FL:CRC Press, 2010.
L. Buşoniu、R. Babuška、B. De Schutter 和 D. Ernst，使用函数逼近器的强化学习和动态规划，博卡拉顿，佛罗里达州：CRC 出版社，2010 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
16.
J. Peters and S. Schaal, “Natural actor-critic”, Neurocomputing, vol. 71, pp. 1180-1190, 2008.
J. Peters 和 S. Schaal，“天生的演员-批评家”，《神经计算》，第 71 卷，第 1180-1190 页，2008 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
17.
V. S. Borkar, “A sensitivity formula for risk-sensitive cost and the actor-critic algorithm”, Syst. Control Lett., vol. 44, no. 5, pp. 339-346, 2001.
V. S. Borkar，“风险敏感成本和演员-评论家算法的敏感度公式”，Syst. Control Lett.，第 44 卷，第 5 期，第 339-346 页，2001 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
18.
D. P. Bertsekas, Dynamic Programming and Optimal Control Volume II, Nashua, NH:Athena Scientific, 2007.
D. P. Bertsekas，动态规划和最优控制卷 II，新罕布什尔州纳舒厄：雅典娜科学，2007 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
19.
R. S. Sutton, D. McAllester, S. Singh and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation” in Advances in Neural Information Processing Systems 12, Cambridge, MA:MIT Press, pp. 1057-1063, 2000.
R. S. Sutton、D. McAllester、S. Singh 和 Y. Mansour，“具有函数逼近的强化学习的政策梯度方法”，载于 Advances in Neural Information Processing Systems 12，马萨诸塞州剑桥：麻省理工学院出版社，第 1057-1063 页，2000 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
20.
S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh and M. Lee, “Natural actor-critic algorithms”, Automatica, vol. 45, no. 11, pp. 2471-2482, 2009.
S. Bhatnagar、R. S. Sutton、M. Ghavamzadeh 和 M. Lee，“自然演员-评论家算法”，《自动》，第 45 卷，第 11 期，第 2471-2482 页，2009 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
21.
C. J. C. H. Watkins, Learning from delayed rewards, 1989.
C. J. C. H. Watkins，《从延迟奖励中学习》，1989 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
22.
C. J. C. H. Watkins and P. Dayan, “Q-Learning”, Mach. Learn., vol. 8, pp. 279-292, 1992.
C. J. C. H. Watkins 和 P. Dayan，“Q-Learning”，Mach. Learn.，第 8 卷，第 279-292 页，1992 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
23.
S. J. Bradtke, B. E. Ydstie and A. G. Barto, “Adaptive linear quadratic control using policy iteration”, Proc. Amer. Control Conf., pp. 3475-3479, 1994.
S. J. Bradtke、B. E. Ydstie 和 A. G. Barto，“使用策略迭代的自适应线性二次控制”，Proc. Amer. Control Conf.，第 3475-3479 页，1994 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
24.
G. A. Rummery and M. Niranjan, “On-line Q-learning using connectionist systems”, Tech. Rep. CUED/F-INFENG/TR 166, 1994.
G. A. Rummery 和 M. Niranjan，“使用连接主义系统的在线 Q 学习”，技术代表 CUED/F-INFENG/TR 166,1994 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
25.
L. Baird, “Residual algorithms: Reinforcement learning with function approximation”, Proc. 12th Int. Conf. Mach. Learn., pp. 30-37, 1995.
L. Baird，“残差算法：具有函数逼近的强化学习”，Proc. 12th Int. Conf. Mach. Learn.，第 30-37 页，1995 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
26.
G. J. Gordon, “Stable function approximation in dynamic programming”, Proc. 12th Int. Conf. Mach. Learn., pp. 261-268, 1995.
G. J. Gordon，“动态规划中的稳定函数逼近”，Proc. 12th Int. Conf. Mach. Learn.，第 261-268 页，1995 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
27.
J. N. Tsitsiklis and B. Van Roy, “Feature-based methods for large scale dynamic programming”, Mach. Learn., vol. 22, pp. 59-94, 1996.
J. N. Tsitsiklis 和 B. Van Roy，“基于特征的大规模动态规划方法”，Mach. Learn.，第 22 卷，第 59-94 页，1996 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation”, IEEE Trans. Autom. Control, vol. 42, no. 5, pp. 674-690, May 1997.
J. N. Tsitsiklis 和 B. Van Roy，“函数近似的时间差分学习分析”，IEEE Trans. Autom。《控制》，第42卷，第5期，第674-690页，1997年5月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
29.
F. S. Melo, S. P. Meyn and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation”, Proc. 25th Int. Conf. Mach. Learn., pp. 664-671, 2008.
F. S. Melo、S. P. Meyn 和 M. I. Ribeiro，“An analysis of reinforcement learning with function approximation”，Proc. 25th Int. Conf. Mach. Learn.，第 664-671 页，2008 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
30.
R. Schoknecht, “Optimality of reinforcement learning algorithms with linear function approximation” in Advances in Neural Information Processing Systems 15, Cambridge, MA:MIT Press, pp. 1555-1562, 2003.
R. Schoknecht，“具有线性函数近似的强化学习算法的最优性”，载于《神经信息处理系统进展》第 15 期，马萨诸塞州剑桥：麻省理工学院出版社，第 1555-1562 页，2003 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
31.
A. Lazaric, M. Ghavamzadeh and R. Munos, “Finite-sample analysis of LSTD”, Proc. 27th Int. Conf. Mach. Learn., pp. 615-622, 2010.
A. Lazaric、M. Ghavamzadeh 和 R. Munos，“LSTD 的有限样本分析”，第 27 届国际会议 Mach. Learn.，第 615-622 页，2010 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
32.
V. Gullapalli, “A stochastic reinforcement learning algorithm for learning real-valued functions”, Neural Netw., vol. 3, no. 6, pp. 671-692, 1990.
V. Gullapalli，“用于学习实值函数的随机强化学习算法”，Neural Netw.，第 3 卷，第 6 期，第 671-692 页，1990 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
33.
R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning”, Mach. Learn., vol. 8, pp. 229-256, 1992.
R. J. Williams，“用于连接主义强化学习的简单统计梯度跟踪算法”，Mach. Learn.，第 8 卷，第 229-256 页，1992 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
34.
J. A. Bagnell and J. Schneider, “Policy search in kernel Hilbert space” in , Pittsburgh, PA:Carnegie Mellon Univ., 2003.
J. A. Bagnell 和 J. Schneider，“内核希尔伯特空间中的政策搜索”，宾夕法尼亚州匹兹堡：卡内基梅隆大学，2003 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
35.
K. Kersting and K. Driessens, “Non-parametric policy gradients: A unified treatment of propositional and relational domains”, Proc. 25th Int. Conf. Mach. Learn., pp. 456-463, 2008.
K. Kersting 和 K. Driessens，“非参数政策梯度：命题和关系域的统一处理”，第 25 届国际会议 Mach. Learn.，第 456-463 页，2008 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
36.
V. M. Aleksandrov, V. I. Sysoyev and V. V. Shemeneva, “Stochastic optimization”, Eng. Cybern., vol. 5, pp. 11-16, 1968.
V. M. Aleksandrov、V. I. Sysoyev 和 V. V. Shemeneva，“随机优化”，Eng. Cybern.，第 5 卷，第 11-16 页，1968 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
37.
P. W. Glynn, “Likelihood ratio gradient estimation: An overview” in Proc. Winter Simul. Conf., Atlanta, GA:ACM Press, pp. 366-375, 1987.
P. W. Glynn，“似然比梯度估计：概述”，载于 Proc. Winter Simul。Conf.，佐治亚州亚特兰大：ACM Press，第 366-375 页，1987 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
38.
J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients”, Neural Netw., vol. 21, no. 4, pp. 682-697, 2008.
J. Peters 和 S. Schaal，“具有政策梯度的运动技能强化学习”，Neural Netw.，第 21 卷，第 4 期，第 682-697 页，2008 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
39.
D. H. Jacobson and D. Q. Mayne, Differential Dynamic Programming (ser. Modern Analytic and Computational Methods in Science and Mathematics vol. 24) , New, York:American Elsevier, 1970.
D. H. Jacobson 和 D. Q. Mayne， Differential Dynamic Programming （ser. Modern Analytic and Computational Methods in Science and Mathematics vol. 24）， New， York：American Elsevier， 1970.
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
40.
P. Dyer and S. R. McReynolds, The Computation and Theory of Optimal Control (ser. Mathematics in Science and Engineering vol. 65) , New, York:Academic, 1970.
P. Dyer 和 S. R. McReynolds， The Computation and Theory of Optimal Control （ser. Mathematics in Science and Engineering vol. 65）， New， York：Academic， 1970.
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
41.
L. Hasdorff, Gradient Optimization and Nonlinear Control, New, York:Wiley, 1976.
L. Hasdorff，梯度优化和非线性控制，New，York：Wiley，1976 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
42.
M. P. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search”, Proc. 28th Int. Conf. Mach. Learn., pp. 465-472, 2011.
M. P. Deisenroth 和 C. E. Rasmussen，“PILCO：一种基于模型和数据高效的政策搜索方法”，第 28 届国际会议 Mach. Learn.，第 465-472 页，2011 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索

M. Riedmiller, J. Peters and S. Schaal, “Evaluation of policy gradient methods and variants on the cart-pole benchmark”, Proc. IEEE Symp. Approx. Dyn. Programm. Reinforcement Learn., pp. 254-261, 2007.
M. Riedmiller、J. Peters 和 S. Schaal，“推杆基准上政策梯度方法和变体的评估”，IEEE Symp. Approx. Dyn.程序。Reinforcement Learn.，第 254-261 页，2007 年。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
44.
J. Peters, K. Mülling and Y. Altün, “Relative entropy policy search”, Proc. 24th AAAI Conf. Artif. Intell., pp. 1607-1612, 2010.
J. Peters、K. Mülling 和 Y. Altün，“相对熵策略搜索”，第 24 届 AAAI 会议论文集。Intell.，第 1607-1612 页，2010 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
45.
I. H. Witten, “An adaptive optimal controller for discrete-time Markov environments”, Inf. Control, vol. 34, pp. 286-295, 1977.
I. H. Witten，“离散时间马尔可夫环境的自适应最优控制器”，《Inf. Control》，第 34 卷，第 286-295 页，1977 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

A. G. Barto, R. S. Sutton and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems”, IEEE Trans. Syst. Man Cybern. , vol. 13, no. 5, pp. 834-846, Sep./Oct. 1983.
A. G. Barto、R. S. Sutton 和 C. W. Anderson，“可以解决困难的学习控制问题的类神经元自适应元件”，IEEE Trans. Syst. Man Cybern。，第13卷，第5期，第834-846页，1983年9月/10月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
47.
S. J. Bradtke and A. G. Barto, “Linear least-squares algorithms for temporal difference learning”, Mach. Learn., vol. 22, pp. 33-57, 1996.
S. J. Bradtke 和 A. G. Barto，“用于时间差分学习的线性最小二乘算法”，Mach. Learn.，第 22 卷，第 33-57 页，1996 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
48.
L. Baird and A. Moore, “Gradient descent for general reinforcement learning” in Advances in Neural Information Processing Systems 11, Cambridge, MA:MIT Press, 1999.
L. Baird 和 A. Moore，“一般强化学习的梯度下降”，载于 Advances in Neural Information Processing Systems 11，马萨诸塞州剑桥：麻省理工学院出版社，1999 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
49.
J. N. Tsitsiklis and B. Van Roy, “Average cost temporal-difference learning”, Automatica, vol. 35, no. 11, pp. 1799-1808, 1999.
J. N. Tsitsiklis 和 B. Van Roy，“平均成本时间差异学习”，《自动》，第 35 卷，第 11 期，第 1799-1808 页，1999 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
50.
I. C. Paschalidis, K. Li and R. M. Estanjini, “An actor-critic method using least squares temporal difference learning”, Proc. Joint 48th IEEE Congr. Decis. Control/28th Chin. Control Conf., pp. 2564-2569, 2009.
I. C. Paschalidis、K. Li 和 R. M. Estanjini，“使用最小二乘时间差分学习的演员-评论家方法”，第 48 届 IEEE 会议联合论文集。德西斯。控制/第 28 下巴。Control Conf.，第 2564-2569 页，2009 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
51.
S. Kakade, “Natural policy gradient” in Advances in Neural Information Processing Systems 14, Cambridge, MA:MIT Press, pp. 1531-1538, 2001.
S. Kakade，“自然政策梯度”，载于《神经信息处理系统进展》第 14 期，马萨诸塞州剑桥：麻省理工学院出版社，第 1531-1538 页，2001 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
52.
J. Peters, S. Vijayakumar and S. Schaal, “Reinforcement learning for humanoid robotics”, 3rd IEEE-RAS Int. Conf. Human. Robots, 2003.
J. Peters、S. Vijayakumar 和 S. Schaal，“人形机器人的强化学习”，第 3 届 IEEE-RAS 国际人类会议。机器人， 2003.
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
53.
Y. Cheng, J. Yi and D. Zhao, “Application of actor-critic learning to adaptive state space construction”, Proc. 3rd Int. Conf. Mach. Learn. Cybern., pp. 2985-2990, 2004.
Y. Cheng、J. Yi 和 D. Zhao，“演员-评论家学习在自适应状态空间构建中的应用”，Proc. 3rd Int. Conf. Mach. Learn。Cybern.，第 2985-2990 页，2004 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
54.
X. Wang, Y. Cheng and J. Yi, “A fuzzy actor-critic reinforcement learning network”, Inf. Sci., vol. 177, pp. 3764-3781, 2007.
X. Wang、Y. Cheng 和 J. Yi，“模糊演员-评论家强化学习网络”，Inf. Sci.，第 177 卷，第 3764-3781 页，2007 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
55.
C. Niedzwiedz, I. Elhanany, Z. Liu and S. Livingston, “A consolidated actor-critic model with function approximation for high-dimensional POMDPs”, Proc. AAAI Workshop Adv. POMDP Solvers, pp. 37-42, 2008.
C. Niedzwiedz、I. Elhanany、Z. Liu 和 S. Livingston，“具有高维 POMDP 函数近似的综合演员-评论家模型”，Proc. AAAI Workshop Adv. POMDP Solvers，第 37-42 页，2008 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
56.
S. Bhatnagar, “An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes”, Syst. Control Lett., vol. 59, no. 12, pp. 760-766, 2010.
S. Bhatnagar，“一种具有函数近似的演员-批评算法，用于折扣成本约束马尔可夫决策过程”，Syst. Control Lett.，第 59 卷，第 12 期，第 760-766 页，2010 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
57.
C. Li, M. Wang, Z. Sun and Z. Zhang, “Urban traffic signal learning control using fuzzy actor-critic methods”, Proc. 5th Int. Conf. Natural Comput., pp. 368-372, 2009.
C. Li， M. Wang， Z. Sun and Z. Zhang， “Urban traffic signal learning control using fuzzy actor-critic methods”， Proc. 5th Int. Conf. Natural Comput.， pp. 368-372， 2009.
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
58.
H. Kimura, T. Yamashita and S. Kobayashi, “Reinforcement learning of walking behavior for a four-legged robot”, Proc. 40th IEEE Conf. Decis. Control, pp. 411-416, 2001.
H. Kimura、T. Yamashita 和 S. Kobayashi，“四足机器人步行行为的强化学习”，第 40 届 IEEE Conf. Decis 论文集。《控制》，第411-416页，2001年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

C. Raju, Y. Narahari and K. Ravikumar, “Reinforcement learning applications in dynamic pricing of retail markets”, Proc. IEEE Int. Conf. E-Commerce, pp. 339-346, 2003.
C. Raju、Y. Narahari 和 K. Ravikumar，“零售市场动态定价中的强化学习应用”，IEEE Int. Conf. E-Commerce，第 339-346 页，2003 年。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
60.
J. Park, J. Kim and D. Kang, “An RLS-based natural actor-critic algorithm for locomotion of a two-linked robot arm” in Lecture Notes on Artificial Intelligence 3801, Berlin/Heidelberg, Germany:Springer-Verlag, pp. 65-72, 2005.
J. Park、J. Kim 和 D. Kang，“An RLS-based natural actor-critic algorithm for locomotion of a two-linked robot arm”，载于 Lecture Notes on Artificial Intelligence 3801，柏林/海德堡，德国：Springer-Verlag，第 65-72 页，2005 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
61.
S. Girgin and P. Preux, “Basis expansion in natural actor critic methods” in Lecture Notes in Artificial Intelligence 5323, Berlin, Germany:Springer-Verlag, pp. 110-123, 2008.
S. Girgin 和 P. Preux，“自然演员批评方法的基础扩展”，载于 Lecture Notes in Artificial Intelligence 5323，德国柏林：Springer-Verlag，第 110-123 页，2008 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

H. Kimura, “Natural gradient actor-critic algorithms using random rectangular coarse coding”, Proc. SICE Annu. Conf., pp. 2027-2034, 2008.
H. Kimura，“使用随机矩形粗编码的自然梯度演员-评论家算法”，Proc.Conf.，第 2027-2034 页，2008 年。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
63.
B. Kim, J. Park, S. Park and S. Kang, “Impedance learning for robotic contact tasks using natural actor-critic algorithm”, IEEE Trans. Syst. Man Cybern. B Cybern., vol. 40, no. 2, pp. 433-443, Apr. 2010.
B. Kim、J. Park、S. Park 和 S. Kang，“使用自然 actor-critic 算法进行机器人接触任务的阻抗学习”，IEEE Trans. Syst. Man Cybern。B Cybern.，第 40 卷，第 2 期，第 433-443 页，2010 年 4 月。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
64.
Y. Nakamura, T. Mori, M.-A. Sato and S. Ishii, “Reinforcement learning for a biped robot based on a CPG-actor-critic method”, Neural Netw., vol. 20, pp. 723-735, 2007.
Y. Nakamura， T. Mori，文学硕士Sato 和 S. Ishii，“基于 CPG-actor-critic 方法的两足机器人的强化学习”，Neural Netw.，第 20 卷，第 723-735 页，2007 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
65.
A. El-Fakdi and E. Galceran, “Two steps natural actor critic learning for underwater cable tracking”, Proc. IEEE Int. Conf. Robot. Autom., pp. 2267-2272, 2010.
A. El-Fakdi 和 E. Galceran，“水下电缆跟踪的两步自然演员评论家学习”，Proc. IEEE Int. Conf. Robot。Autom.，第 2267-2272 页，2010 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

D. Vengerov, N. Bambos and H. R. Berenji, “A fuzzy reinforcement learning approach to power control in wireless transmitters”, IEEE Trans. Syst. Man Cybern. B Cybern., vol. 35, no. 4, pp. 768-778, Aug. 2005.
D. Vengerov、N. Bambos 和 H. R. Berenji，“无线发射器功率控制的模糊强化学习方法”，IEEE Trans. Syst. Man Cybern。B Cybern.，第35卷，第4期，第768-778页，2005年8月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

W. Usaha and J. A. Barria, “Reinforcement learning for resource allocation in LEO satellite networks”, IEEE Trans. Syst. Man Cybern. B Cybern., vol. 37, no. 3, pp. 515-527, Jun. 2007.
W. Usaha 和 J. A. Barria，“用于 LEO 卫星网络中资源分配的强化学习”，IEEE Trans. Syst. Man Cybern。B Cybern.，第 37 卷，第 3 期，第 515-527 页，2007 年 6 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
68.
T. Morimura, E. Uchibe, J. Yoshimoto and K. Doya, “A generalized natural actor-critic algorithm” in Advances in Neural Information Processing Systems 22, Cambridge, MA:MIT Press, pp. 1312-1320, 2009.
T. Morimura、E. Uchibe、J. Yoshimoto 和 K. Doya，“一种广义的自然参与者-评论家算法”，载于 Advances in Neural Information Processing Systems 22，马萨诸塞州剑桥：麻省理工学院出版社，第 1312-1320 页，2009 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
69.
A. L. Samuel, “Some studies in machine learning using the game of checkers”, IBM J. Res. Dev., vol. 3, no. 3, pp. 211-229, 1959.
A. L. Samuel，“Some studies in machine learning using the game of checkers”，IBM J. Res. Dev.，第 3 卷，第 3 期，第 211-229 页，1959 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

C. C. Lee, “Fuzzy logic in control systems: Fuzzy logic controller—Part I”, IEEE Trans. Syst. Man Cybern., vol. 20, no. 2, pp. 404-418, Mar./Apr. 1990.
C. C. Lee，“控制系统中的模糊逻辑：模糊逻辑控制器—第一部分”，IEEE Trans. Syst. Man Cybern.，第 20 卷，第 2 期，第 404-418 页，1990 年 3 月/4 月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索

J. C. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation”, IEEE Trans. Autom. Control, vol. 37, no. 3, pp. 332-341, Mar. 1992.
J. C. Spall，“使用同步扰动梯度近似的多元随机近似”，IEEE Trans. Autom。《控制》，第37卷，第3期，第332-341页，1992年3月。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
72.
J. L. Williams, J. W. Fisher and A. S. Willsky, “Importance sampling actor-critic algorithms”, Proc. Amer. Control Conf., pp. 1625-1630, 2006.
J. L. Williams、J. W. Fisher 和 A. S. Willsky，“Importance sampling actor-critic algorithms”，Proc. Amer. Control Conf.，第 1625-1630 页，2006 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
73.
S. Amari and S. C. Douglas, “Why natural gradient?”, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1213-1216, 1998.
S. Amari 和 S. C. Douglas，“为什么是自然梯度？”，Proc. IEEE Int. Conf. Acoust。《语音信号处理》，第 1213-1216 页，1998 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
74.
J. A. Bagnell and J. Schneider, “Covariant policy search”, Proc. 18th Int. Joint Conf. Artif. Intell., pp. 1019-1024, 2003.
J. A. Bagnell 和 J. Schneider，“协变政策搜索”，第 18 届国际联合会议 Artif。Intell.，第 1019-1024 页，2003 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
75.
S. Amari, “Natural gradient works efficiently in learning”, Neural Comput., vol. 10, no. 2, pp. 251-276, 1998.
S. Amari，“Natural gradient works efficient in learning”，Neural Comput.，第 10 卷，第 2 期，第 251-276 页，1998 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
76.
F. S. Melo and M. Lopes, “Fitted natural actor-critic: A new algorithm for continuous state-action MDPs”, Proc. Eur. Conf. Mach. Learn. Knowl. Discov. Databases, pp. 66-81, 2008.
F. S. Melo 和 M. Lopes，“拟合的自然演员-评论家：连续状态动作 MDP 的新算法”，Proc. Eur. Conf. Mach. Learn。知道。迪科夫。《数据库》，第66-81页，2008年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
77.
S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh and M. Lee, “Incremental natural actor-critic algorithms”, Proc. Adv. Neural Inf. Process. Syst., pp. 105-112, 2008.
S. Bhatnagar、R. S. Sutton、M. Ghavamzadeh 和 M. Lee，“增量自然演员-评论家算法”，Proc. Adv. Neural Inf. Process。Syst.，第 105-112 页，2008 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
78.
T. Morimura, E. Uchibe, J. Yoshimoto and K. Doya, “A new natural policy gradient by stationary distribution metric” in Lecture Notes in Artificial Intelligence 5212, Berlin, Germany:Springer-Verlag, pp. 82-97, 2008.
T. Morimura、E. Uchibe、J. Yoshimoto 和 K. Doya，“通过稳态分布指标划分的新自然政策梯度”，载于 Lecture Notes in Artificial Intelligence 5212，德国柏林：Springer-Verlag，第 82-97 页，2008 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
79.
T. Morimura, E. Uchibe, J. Yoshimoto, J. Peters and K. Doya, “Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning”, Neural Comput., vol. 22, no. 2, pp. 342-376, 2010.
T. Morimura、E. Uchibe、J. Yoshimoto、J. Peters 和 K. Doya，“用于策略梯度强化学习的对数稳态分布导数”，Neural Comput.，第 22 卷，第 2 期，第 342-376 页，2010 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索

H. Benbrahim, J. Doleac, J. A. Franklin and O. Selfridge, “Real-time learning: A ball on a beam”, Proc. Int. Joint Conf. Neural Netw., pp. 98-103, 1992.
H. Benbrahim、J. Doleac、J. A. Franklin 和 O. Selfridge，“实时学习：梁上的球”，Proc. Int. Joint Conf. Neural Netw.，第 98-103 页，1992 年。
Show in Context View Article
在上下文中显示查看文章
Google Scholar
Google 学术搜索
81.
V. Gullapalli, “Learning control under extreme uncertainty” in Advances in Neural Information Processing Systems 5, San Mateo, CA:Morgan Kaufmann, pp. 327-334, 1993.
V. Gullapalli，“极端不确定性下的学习控制”，载于《神经信息处理系统进展》第 5 期，加利福尼亚州圣马特奥：Morgan Kaufmann，第 327-334 页，1993 年。
Show in Context Google Scholar
在上下文中显示 Google 学术搜索
82.
H. Benbrahim and J. A. Franklin, “Biped dynamic walking using reinforcement learning”, Robot. Auton. Syst., vol. 22, pp. 283-302, 1997.
H. Benbrahim 和 J. A. Franklin，“使用强化学习的两足动态行走”，机器人。奥托恩。《系统》，第22卷，第283-302页，1997年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
83.
V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms for Markov decision processes”, SIAM J. Control Optim., vol. 38, no. 1, pp. 94-123, 1999.
V. R. Konda 和 V. S. Borkar，“马尔可夫决策过程的演员-批评家型学习算法”，SIAM J. Control Optim.，第 38 卷，第 1 期，第 94-123 页，1999 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
84.
J. Kober and J. Peters, “Policy search for motor primitives in robotics”, Mach. Learn., vol. 84, pp. 171-203, 2011.
J. Kober 和 J. Peters，“机器人技术中电机原语的政策搜索”，Mach. Learn.，第 84 卷，第 171-203 页，2011 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索
85.
N. Vlassis, M. Toussaint, G. Kontes and S. Piperidis, “Learning model-free robot control by a Monte Carlo EM algorithm”, Auton. Robots, vol. 27, no. 2, pp. 123-130, 2009.
N. Vlassis、M. Toussaint、G. Kontes 和 S. Piperidis，“通过蒙特卡罗电磁算法学习无模型机器人控制”，Auton。《机器人》，第 27 卷，第 2 期，第 123-130 页，2009 年。
Show in Context CrossRef Google Scholar
在上下文中显示 CrossRef Google 学术搜索