伪代码
1. Q-learning
an algorithm which produces a Q-table that an agent uses to find the best action to take given a state.
2. Replay Buffer
-
经验池作用1: DQN中引入经验池是为了解决数据相关性,使数据(尽量)独立同分布的问题
-
经验回放的作用:我们知道在训练神经网络的时候是假设训练数据是独立同分布的,但如果你采取当前参数下的网络获得的样本来更新当前的网络参数,那么这些顺序数据之间存在很强的关联性,网络的训练会很不稳定,现在利用经验回放的方法,你在更新当前时刻参数时会随机用到不同时刻参数下获得的样本,这样样本之间的关联性相对来说比较小。
-
强化学习算法是在不确定环境中,通过与环境的不断交互,来不断优化自身策略的算法。它和监督学习、非监督学习有着一些本质上的区别:
获取的数据 非独立同分布:大部分机器学习算法假设数据是独立同分布的,否则会有收敛性问题;然而智能体与环境进行交互产生的数据具有很强的时间相关性,无法在数据层面做到完全的解耦,不满足数据的独立同分布性质,因此强化学习算法训练并不稳定;智能体的行为同时会影响后续的数据分布; -
经验池作用2英文文献
翻译文献
Our problem is that we give sequential(输入的是按照样本顺序来的, 1–>2–>3) samples from interactions with the environment to our neural network. And it tends to forget the previous experiences as it overwrites with new experiences.
For instance, if we are in the first level and then the second (which is totally different), our agent can forget how to behave in the first level.
(比如,如下图,如果我们先在一个场景学习,然后再到下一个场景学习,这时agent就会忘记如何在第一个场景运动.)
This prevents the network from only learning about what it has immediately done.
图片描述: By learning how to play on water level, our agent will forget how to behave on the first level
As a consequence, it can be more efficient to make use of previous experience, by learning with it multiple times.
算法复现注意点
- Sample random minibatch of transitions ( s t , a t , r t , s t + 1 ) (s_t, a_t, r_t, s_{t+1}) (st,at,rt,st+1) from replay memory
- 动作从主网络的 Q Q Q值里面选
- 最大值
Q
Q
Q值从
T
a
r
g
e
t
N
e
t
w
o
r
k
Target Network
TargetNetwork 里面选