A Theoretical Analysis of DQN_“a theoretical analysis of deep q-learning,” in le-CSDN博客

本文链接：https://blog.csdn.net/weixin_45776027/article/details/117299674

这是一篇笔记文。

这篇文章对于一个slight simplification DQN做了分析（Under mild assumptions)

this paper 主要重点在于DQN的两个点：分别是experience replay and the target network

approximating the action-value function often leads to instability
the target network to obtain an unbiased estimator of the mean-squared Bellman error used in training the Q-network.
（但目标网络在每次迭代过程中与Q网络同步的话，将导致耦合，这点TD3通过延迟更新已经解决）

主要通过这两点分别解释了经验回放与目标网络优化的必要性。

补充：

经验回放自不必多说，将五元组按时序放入经验池中分别采用随机抽样/优先排序的方法再送入网络去拟合动作价值函数Q，能够得到较好的稳定性。
（The intuition behind experience replay is to achieve stability by breaking the temporal dependency among the observations used in training the deep neural network.）
目标函数可以被拆成 mean-squared Bellman error (MSBE) 和 variance 两个部分，当没有采用target network更新时，MSBE与方差这两部分都与待优化的参数有关，因此优化目标函数不等同于仅优化 MSBE。在有 target network 的时候，第二项就和待优化的参数无关了，因此问题就变成了优化 MSBE。、这某种程度上解释了 target network 的必要性。

FQI:Fitted Q-iteration
可以看作是对replay buffer和target network的简化版本，主要是通过对神经网络拟合FQI分析，从而得到收敛性证明。

Assumption:

将 replay buffer 用一个固定的分布代替了，直接回避掉了RL中的探索问题，比如说Agent从初始分布出发，不管采用哪种策略，都不会过多的观测到特定状态
Theorem:
从误差的角度来分析，Traget network采用何种频率更新
proof sketch

K为迭代次数，

为前K次的最大损失值；
式中第一项会随着n -> ∞而变为0，第二项随着迭代次数的增加而指数衰减，重要性不高。

第一个式子表示用relu线性函数拟合偏差bias
第二、三式反应了估计的偏差Variance

（结论：the statistical error characterizes the bias and variance that arise from approximating the action-value function using neural network, while the algorithmic error geometrically decays to zero as the number of iteration goes to infinity）