Udacity DRL DQN

A. DQN文章细节要点
  1. Pre-processing: 将原始输入转化为正方形照片,这样可以使用gpu加速训练。
  2. Frames Stacking : 将连续的四帧图片打包成一个输入,即84 * 84 * 4,因此DQN也可以capture time correlation.
  3. Frame-Skipping Technique:训练时,每4帧选择一次动作
  4. Experience Replay:
    • data efficient
    • 打破连续state的correlation,减少了q function的overfitting。
    • 在某个策略下,得到的transition tuple的分布不均匀,而buffer随机采样tuple可以使得其分布均匀,在某个策略下(或者说某个范围的策略参数),得到的transition tuple的分布比较集中,情况不均匀,而buffer随机采样tuple可以使得其分布均匀,减少overfitting。
  5. Fixed Q Target: 使得target不能被开导,否则无法收敛。
  6. Off policy method: 用来更新策略的transition tuple并非由当前策略采样得到。
  7. Reward Clipped
  8. Error Clipped
  9. Target Networks Soft Update
B. Advanced DQN
  1. Double Q Learning
  2. Dueling DQN
  3. Multi-step Return
  4. Prioritized Replay
  5. Noise Net
  6. Distributtional Q Learning
  7. Rainbow

II. DQN文章细节要点展开

3. Frame-Skipping Technique

4. Experience Replay

优点:

  1. data efficient
  2. 打破连续state的correlation,减少了q function的overfitting。
  3. 在某个策略下,得到的transition tuple的分布不均匀,而buffer随机采样tuple可以使得其分布均匀,在某个策略下(或者说某个范围的策略参数),得到的transition tuple的分布比较集中,情况不均匀,而buffer随机采样tuple可以使得其分布均匀,减少overfitting。

5. Fixed Q Target

Experience replay helps us address one type of correlation.That is between consecutive experience tuples.There is another kind of correlation that Q-learning is susceptible (易受影响) to.

Q-learning is a form of Temporal Difference or TD learning, right? Now, the TD target here is supposed to be a replacement for the true Q function (Q pi) which is unknown. We originally used Q pi to define a squared error loss, and differentiated that with respect to w to get our gradient descent update rule. Now, Q pi is not dependent on our function approximation or its parameters, thus resulting in a simple derivative, an update rule. But, our TD target is dependent on these parameters which means simply replacing the Q pi with a target like this is mathematically incorrect.

  • Moving Target: It can affect learning significantly when we use function approximation, where all the Q-values are intrinsically tied together through the function parameters. (Do no harm to Q table representation since every Q-values are stored separately.)

You may be thinking, "Doesn't experience replay take care of this problem?"Well, it addresses a similar but slightly different issue.There we broke the correlation effects between consecutive experience tuples by sampling them randomly out of order. Here, the correlation is between the target and the parameters we are changing. (The target is moving! )

Solution: Set the target net separately periodically, decoupling the target from the parameters.

6. Fixed Target Net

SAMPLE: 采样 transition tuple并存储到buffer中
LEARN: 从buffer中随机采样tuple来更新 Q function。

由于这两步不直接互相依赖,所以是off policy。

In the 'LEARN', we select the small batch of tuples from this memory, randomly, and learn from that batch using a gradient descent update step. These two processes are not directly dependent on each other.So, you could perform multiple sampling steps then one learning step ,or even multiple learning steps with different random batches.

7. Reward Clipping

8. Error Clipping


III Advanced DQN

3. Dueling DQN

  • 改动:Dueling DQN只是比原来的DQN多了一层结构: V和A。其中V值为Q的均值,A为Q-V。
  • 好处:提高泛化能力,增加样本利用效率。
  • 例子:当更新Q值后,比如第二列的前两个由Q值都增加了1,那么我们只需要在V增加1,而A不改动,这样带来的结果就是Q中的第三个值也增加了1,无形之中就利用了两个样本更新了3个值。
  • 细节:
    • 必须加一个contraint ,即A 中对应每个state的所有动作Q值之和为0,这样使得可以尽可能迫使更新V来调整Q,否则可能只更新A来调整Q,而不是更新V。比如当某列Q值都需要加1,这样不可能在A中相应列的值都加1,因为这样A值之和就不为0了,所以网络倾向于调整V来调整Q。
    • 如何加constraint:A中的每一列要进行进行normalization,即每个Q值减去该列的均值,使得每列Q值之和为0.  ###4. Prioritized Replay
      对于buffer中TD error更大的transition tuple,我们可以做得更好,那么我们可以使这些tuple被更新的概率更大一些。
      Method: When creating batches, using target error to compute a sampling probability, and store it along with each corresponding tuple in the replay buffer.

Three things should be noticed when using Prioritized Replay:

  1. 不能忽视target error为零或者很小的tuple.
  2. 不能过度使用prioritized概率高的tuple(overfitting)
  3. 纠正因prioritized sampling造成的distribution bias

1.不能忽视target error为零或者很小的tuple.
某些state的Target Error为零或者很小不代表已经更新到收敛了,有可能是我们的estimated target由于样本量不够多而没更新好,此时我们不能忽视这些tuple,因此可以通过一个常数e使其被sample的概率不为0。
Note that if the TD error is zero,then the priority value of the tuple andhence its probability of being picked will also be zero. Zero or very low TD error doesn't necessarily mean we have nothing more to learn from such a tuple, it might be the case that our estimate was closed due to the limited samples we visited till that point.

Solution: to prevent such tuples from being starved for selection,we can add a small constant ‘e’ to every priority value.

2.不能过度使用prioritized概率高的tuple(overfitting)
Another issue along similar lines is that greedily using these priority values may lead to a small subset of experiences being replayed over and over resulting in a overfitting to that subset.

Solution: reintroducing(重新引入) some element of uniform random sampling. This adds another hyperparameter ‘a’ which we use to redefine the sampling probability. We can control how much we want to use priorities versus randomness by varying this parameter. (When A equals zero corresponds to pureuniform randomness and A equals one only uses priorities.)

3.纠正因prioritized sampling造成的distribution bias

When we use prioritized experience replay,we have to make one adjustment to our update rule. Remember that our original Q learning update is derived from an expectation over all experiences.When using a stochastic update rule, the way we sample these experiences must match the underlying distribution they came from. This is preserved when we sample experience tuples uniformly from the replay buffer,but this assumption is violated when we use a non-uniform sampling, for example, using priorities. The q values we learn will be biased according to these priority values which we only wanted to use for sampling.

Solution: To correct for this bias,we need to introduce an important sampling weight equal to, one over n, where n is the size of this replay buffer, times one over the sampling probability Pi. We can add another hyperparameter B and raise each important sampling weight to B, to control how much these weights affect learning. In fact, these weights are more important towards the end of learning when your q values begin to converge. So, you can increase B from a low value to one over time. Again, these details may be hard to understand at first,but each small improvement can contribute a lot towards more stable and efficient learning.So, make sure you give the prioritized experience replay paper a good read.

7. Rainbow

文章链接
因为上述各种改善方法是可以共存的,会因此全部集合起来使用就称为Rainbow算法。下面是各种算法的对比图(DDQN是Double DQN,multi-step包括在A3C中):

可以发现rainbow中排除double Q对其影响不大,文中的解释是因为distribution Q也起到了缓解DQN的 over estimate的作用,而double Q也只是起到这个作用。

转载于:https://www.cnblogs.com/bourne_lin/p/Udacity-DRL-DQN.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值