目录标题
- 第七课:Temporal-Difference Learning(TD)
- 第八课:值函数近似
- 1. Motivating examples: curve fitting
- 第九课:Policy Gradient Methods
- 第十课:Actor-Critic Methods
- 1. The simplest actor-critic algorithm
感谢up主提供的学习视频:
【【强化学习的数学原理】课程:从零开始到透彻理解(完结)
文章脉络:
第七课:Temporal-Difference Learning(TD)
1. Motivating example
2.TD learning of state values
TD learning often refers to a broad class of RL algorithms.
However, in this section TD learning specifically refers to a classic algorithm for estimating state values.
首先给出公式:
- The TD error should be zero in the expectation sense. the TD error reflects the deficiency between the current estimate vt and the true state value vπ. 证明:
这个error说明了当前的估计离最优策略还有差距,因此可以利用这个error继续改进当前的估计。 - 为什么说
v
t
ˉ
\bar{v_t}
vtˉ被称为TD target?
- other property
- 理解TD算法
- TD算法是在没有模型的情况下求解贝尔曼公式!
- TD算法是由RM算法演变过来的(视频有讲解)
- TD算法本质上是在做一个policy evaluation的事情
- 算法比较:
注意几点:
- Continual task是指永不会终止的eposide,在实际中可以认为是step很长的情况。
- Online理解成得到了一个reward跳到下一个状态,立即就可以用来更新策略;而MC算法是offline,需要将一整个eposide跑完才能用来更新。
3. TD learning of action values: Sarsa
The TD algorithm introduced in the last section can only estimate state values. In this section, we introduce, Sarsa, an algorithm that can directly estimate action values.注意,估计出action value后,还需要使用policy improvment算法得到最优策略!!
1)算法公式
可以看到,该算法是直接将前一节的TD算法中
v
(
s
)
v(s)
v(s)替换成了
q
(
s
t
,
a
t
)
q(s_t,a_t)
q(st,at).
2)伪代码
其中update q-value也就是policy evaluate,update policy也就是policy improvement;
其次,这里只需要有一个数据就直接用于更新q-value,然后马上update policy,和以前介绍的算法是准确更新q-value的值不同。
3)推广: TD learning of action values: Expected Sarsa
直接对st+1和at+1做expectancy:(公式是先对A做的expectancy,然后对s)
3. TD learning of action values: n-step Sarsa
它是Sarsa和MC的一种结合。
(就是使用stoactic approximation的思想,得到这个迭代公式)
4. TD learning of optimal action values: Q-learning
It should be noted that Sarsa can only estimate the action values of a given policy. It must be combined with a policy improvement
step to find optimal policies and hence their optimal action values. By contrast, Q-learning can directly estimate optimal action values.
1)算法公式
the expression of Q-learning that it is a stochastic approximation algorithm for solving the action values from the following equation:
This equation actually is the Bellman optimality equation expressed in terms of action values.(证明见书)
2)Off-policy vs on-policy
There exist two policies in a TD learning task: behavior policy and target policy.
- The behavior policy is used to generate experience samples.
- The target policy is constantly updated toward an optimal policy
on-policy就是用behavior policy和target policy相同,策略得到一个experience后,立马更新policy;off-policy中两种策略可以不同(也就是说也可以相同)然后用behavior policy与环境交互得到很多个experience后,再更新最后的target policy!!
判断各算法是off还是on?
- SARSA
利用πt得到at+1,然后继续更新qt+1,说明behavior 和target policy是一样的!所以是on policy。
- MC
- Q-learning
可以看到,TD target所有的变量全是已知的了,不依赖任何策略就能得到,而Sarsa算法还有个at+1是不知道的。
3)Q learning算法
- on-policy形式
和Sarsa算法很类似,只有update q-value的地方有区别!
- off-policy形式
注意,这里update πT使用的是greedy而不是ε-greedy。behavior policy πb采用的是ε-greedy策略,当ε很小时,探索性比较弱,越难收敛到最优策略。
5. A unified viewpoint
all the algorithms can also be expressed in a unified expression:
第八课:值函数近似
1. Motivating examples: curve fitting
where w is the parameter vector and
φ
(
s
)
φ(s)
φ(s) is the feature vector of s. It is notable that
v
π
(
s
)
v_π(s)
vπ(s) is linear in w. (是关于w的线性函数!!!!)也就是说改变w就可以改变
v
π
(
s
)
v_π(s)
vπ(s).
2. Foundation-Objective function
Our goal is to find the best w that can minimize J(w). While S in (8.2) is a random variable, what is the probability distribution of S?
- first: even distribution。 setting the probability of each state as 1/|S|.不合乎实际
- the stationary distribution. 总结就是,根据当前这个策略下一直走很多很多步,访问到每个状态的概率值最终会趋于稳定。
可以根据状态转移概率 P π P_π Pπ求出稳定的概率值。
值得注意的是,当ε越小时(即越逼近greedy策略),s4状态的stationary probility 趋于1,其他趋于0.
2. Optimization algorithms
To minimize the objective function J(w) ,we can use the gradient-descent algorithm:
By the spirit of stochastic gradient descendent, we can remove the expectation operation from (8.4) to obtain
the following algorithm:
where s t is a sample of S. 由于涉及到vπ,所以是不能使用的,We can replace vπ(st) with an approximation so that the algorithm is implementable.
- 首先是采用Monte Carlo learning with function approximation:
gt is the discounted return calculated starting from st in the episode. Then,gt can be used as an approximation of vπ(st). - TD learning with function approximation.
(linear case 见书详解:因为采用这种方法求导时很方便,是关于w的线性函数)
3. Illustrative examples
上图是根据策略(每个action的概率都是0.2)得到的理论state value。
- 下面展现使用TD-table算法,Each episode has 500 steps and starts from a randomly selected state-action pair following a uniform distribution.:
- TD-Linear algorithm。考虑最简单的办法,使用平面去拟合value,因此
然后可以得到:
使用前面的算法,可以计算得到最后的w的值,从而拟合出state value。
为提高精度,还可以 increase the dimension of the feature vector. To that end, we can consider:
4. Sarsa with function approximation
将前面对vπ的function approximation改成action value即可得到Sarsa算法:
实质上,这一步还是在做policy evaluation,根据π得到action value。
When linear functions are used, we have
5. Q-learning with function approximation
The update rule is:
伪代码:
- on-policy
- off-policy
5. Deep Q-learning
1)Algorithm description
aims to minimize the objective function:
This objective function can be viewed s the Bellman optimality error. That is because
is the Bellman optimality equation.
为了让求导容易计算,引入一个target network:
q
^
(
s
,
a
,
w
T
)
\hat{q}(s,a,w_T)
q^(s,a,wT)替换max里面的那个q,other one is a main network representing
q
^
(
s
,
a
,
w
)
\hat{q}(s,a,w)
q^(s,a,w). 这样对w求导的时候,就不考虑
w
T
w_T
wT了。The objective function in this case degenerates to
2)经验回放
为什么dqn算法需要经验回放?
- 在强化学习中,智能体通过与环境的交互收集训练样本,这些样本通常是连续的,相邻时间步的状态和动作之间可能高度相关。如果在训练过程中直接使用这些样本,会导致训练数据之间存在较强的相关性,使得训练过程不稳定,难以收敛到良好的策略。经验回放的主要目的之一就是打破这种相关性,使得训练样本之间更独立,从而提高训练的稳定性。
- 经验回放的基本思想是将智能体与环境交互收集到的训练样本存储在一个缓冲区中,然后在训练过程中随机抽样这些样本来进行训练。这样做的好处包括减少样本相关性、提高训练稳定性、更好地利用历史经验等。
- A benefit of random sampling is that each experience sample may be used multiple times, which can increase the data efficiency
为什么要采用均匀分布采样:
- 在强化学习任务中,智能体与环境交互产生的训练数据通常是连续的时间序列,相邻时间步的数据具有较强的时间相关性。如果采用连续的样本来进行训练,可能会导致训练过程不稳定,难以收敛。通过均匀分布的随机抽样,可以在训练中引入不同时间步的样本,从而打破时间相关性,提高训练的稳定性。
- 如果我们将状态-动作对(S,A)看作一个单一的变量,而不是两个随机变量,那么我们可以消除样本(S,A,R,S’)对策略πb的依赖性,不再涉及到对策略πb的依赖性,即不再考虑πb如何选择动作A。
具体算法过程:
answer:
-
在DQN(Deep Q-Network)算法中,并没有显式地维护一个策略(policy),而是通过更新值函数来间接地影响智能体的行为。DQN使用Q-learning算法的思想来学习最优的动作值函数(Q函数),并根据学习到的Q函数来做出决策。
-
用神经网络近似就没必要使用了
-
这里神经网络的输入是(s,a),原文的输入是s,输出是所有action的value。
6. Illustrative examples
第九课:Policy Gradient Methods
1. Basic idea of policy gradient
- 之前使用表格形式表示一个策略下的action value,现在represented by a function, a policy π is fully determined by θ together with the function structure。
- 现在的问题是how to update policies?之前的tabular形式可以直接修改值,现在需要updating the parameter θ来改进策略。然后很自然的想到使用PG算法更新 θ:
接下来就是如何找到一个指标用于衡量optimal policy。
2. Metrics to define optimal policies
1)average value
The first metric is the average state value or simply called average value
然后如何选取
d
π
(
s
)
d_π(s)
dπ(s)有两个方法,一种是 treat all the states equally important and hence select
d
π
(
s
)
d_π(s)
dπ(s)= 1/|S| for every s。另一种是 select
d
π
(
s
)
d_π(s)
dπ(s) as the stationary distribution satisfying:
v
ˉ
π
\bar{v}_π
vˉπ的另一种表达式:J():
2)average reward
The second metric is the average one-step reward or simply called average reward
the metric is defined as:
下面讨论这个metric的性质:
两种metric之间的关系:the two metrics are equivalent to each other. In the discounted case where γ < 1:
3. Gradients of the metrics
首先给出通用的公式:
其中,
v
ˉ
π
0
\bar{v}_π^0
vˉπ0表示S的distribution和π没有关系。
其中
d
π
{d}_π
dπ表示stationary distribution。ρπ是另一种分布。第一个式子中0<γ<1时取约等于,γ=1时是严格等于。
1)公式
先给出公式:
也就是说,写成expectantion是为了可以用采样近似求解。
2)证明
因为需要π的表达式,所以引入一个lnπ的梯度:
然后推导:
值得注意的是,ln(x)表示x必须大于0,因此π也要大于0,如何操作呢?
使用一个神经网络,whose input is s and parameter is θ. The network has |A| outputs,each of which corresponds to π(a | s,θ) for an action a. The activation function of the output layershould be softmax.
另外,为什么说PG算法都是on-policy的呢?J的梯度公式可以知道,采样的A必须满足π的分布,π也就是behavior policy,而π刚好也是需要改进的策略,即target policy。
4. Policy gradient by Monte Carlo estimation
we can replace the expected value with a sample:
值得注意的是,the policy gradient method is on-policy.
If
q
π
(
s
t
,
a
t
)
q_π(st ,at)
qπ(st,at) is approximated by Monte Carlo estimation, the algorithm is called REINFORCE [38] or Monte Carlo policy gradient.喜欢上述公式的ln部分:
对其进行仔细分析:
总的来说就是:
- 第一,qt大意味着当前action好,因此可以多多利用;
- 第二,πt小,但βt成反比,使得πt+1大一点,这就是探索。
注意,得到了 θt+1后并没有马上更新策略然后去产生数据,这是因为这里采用的MC算法是off-line的,需要得到所有的eposide的值后才能更新策略。而后面的TD算法可以得到一个 θt+1后马上更新去产生新的数据。
第十课:Actor-Critic Methods
1. The simplest actor-critic algorithm
在上述公式中:
- If q π ( s t , a t ) q_π(st ,at) qπ(st,at) is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE.
- If q π ( s t , a t ) q_π(st ,at) qπ(st,at) is estimated by TD learning, the corresponding algorithms are usually called actor-critic.
其中,TD算法采用Sarsa的话,得到下面算法流程:
其中,stochastic是指PG算法中已经采用softmax对动作输出,已经具有探索能力了。
2. A2C
1)Baseline invariance
it is invariant to an additional baseline. That is
- it’s still valid by introducing b(S), because:(see the prove in book)
- Second, why is the baseline useful?
it can reduce the approximation variance,let
the true gradient is E [ X ( S , A ) ] E[X(S,A)] E[X(S,A)], then we can caculate the var(E(X))。the E(X) is independent with b,but the var is not。the less var means the sample I get is more specific close to E(X). So It’s better to find the optimal b(S). Our goal is to design a good baseline to minimize var(X):
but it’s not easy to get,so there is more convenient way:
2) Algorithm description
infer the algorithm:
If
δ
π
(
s
,
a
)
δπ(s,a)
δπ(s,a) > 0, it means that the corresponding action has a greater value than the mean value.
take place for the expectation using stochastic:
advantage function in A2C implementation is approximated by the TD error:
Now we only need use one network to approximate the v(s).
At last,the whole pipline is :
analyze the algorithm:
- the policy π(θ t ) is stochastic and hence exploratory. so the ε-greedy is no need.
- A2C is on-policy。
3. Off-policy actor-critic
why are REINFORCE, QAC and A2C all on-policy?
A must the distribution of π。π is the behavior policy as well as target policy that need to be updated.
1)Importance sampling
summarize:use distribution p1‘s expectation to approximate distribution p0.
the relationship of the two distribution:
we can see the importance weight,when po(xi) is large and p1(xi) is small, means the sample xi in in pi is more important for its scarce . so the weight should be large.
(1)An illustrative example
the distribution p1:
p(1)=0.8 , p(-1)=0.2;
E
X
∼
p
1
[
X
]
E_{X∼p1} [X]
EX∼p1[X] = 0.6
but I need the
E
X
∼
p
0
[
X
]
E_{X∼p0} [X]
EX∼p0[X] = 0.0
So Does this have anything to do with the algorithm?
yes,we use the technology of the important sampling to realize the off-policy on A2C。
2)The off-policy policy gradient theorem
Suppose that β is a behavior policy. Our goal is to use the samples generated by β to learn a target policy π that can maximize the following metric:
where
d
β
d_β
dβ is the stationary distribution under policy β .
So, the Off-policy policy gradient theorem:
3)Algorithm description
employ the baseline b(s) to the theorem above:
To reduce the estimation variance, we can select the baseline as b(S) = vπ(S) and the advantage function
q
t
(
s
,
a
)
−
v
t
(
s
)
qt(s,a) − vt(s)
qt(s,a)−vt(s) can be replaced by the TD error. That is
Then, the algorithm becomes:
It’s also can be convert to:
From this equation, if the numerator is large,means next time the probility of chosing this action will more possible ,so it is adequate exploitation。However, the denominator is different like before,it’s instant now。
Finally, we got :
4. Deterministic actor-critic
Up to now, the policies used in the policy gradient methods are all stochastic since it is required that π(a|s,θ) > 0 for every (s,a). It is important to study the deterministic case since it is naturally off-policy and can effectively handle continuous action spaces.
now we use
a
=
µ
(
s
,
θ
)
a=µ(s,θ)
a=µ(s,θ) to denote a deterministic policy. it is a mapping from S to A. This deter-ministic policy can be represented by, for example, a neural network
with s
as its input, a
as its output, and θ
as its parameter.we often write µ(s,θ) as µ(s) for short.
Deterministic policy gradient:
From this equation , the gradient in the deterministic case shown above does not involve the action random variable A. As a result, when we use samples to approximate the true gradient, it is not required to sample actions. That’s also the reason why DPG is off-policy.
So let’s apply the gradient-ascent algorithm to maximize J(θ):
It should be noted that this algorithm is off-policy since the behavior policy β may be different from µ
How to select the behavior policy β? It can be any exploratory policy. It can also be a stochastic policy obtained by adding noise to µ.