经验回放:Experience Replay(训练DQN的一种策略)
优点:可以重复利用离线经验数据;连续的经验具有相关性,经验回放可以在离线经验BUFFER随机抽样,减少相关性;
超参数:Replay Buffer的长度;
∙ Find w by minimizing L(w)=1T∑t=1Tδt22.∙ Stochastic gradient descent (SGD):∙ Randomly sample a transition, (si,ai,ri,si+1),from the buffer∙ Compute TD error, δi.∙ Stochastic gradient: gi=∂δi2/2∂w=δi⋅∂Q(si,ai;w)∂w∙ SGD: w←w−α⋅gi. \begin{aligned} &\bullet\text{ Find w by minimizing }L(\mathbf{w})=\frac{1}{T}\sum_{t=1}^{T}\frac{\delta_{t}^{2}}{2}. \\ &\bullet\text{ Stochastic gradient descent (SGD):} \\ &\bullet\text{ Randomly sample a transition, }(s_i,a_i,r_i,s_{i+1}),\text{from the buffer} \\ &\bullet\text{ Compute TD error, }\delta_i. \\ &\bullet\text{ Stochastic gradient: g}_{i}=\frac{\partial\delta_{i}^{2}/2}{\partial \mathbf{w}}=\delta_{i}\cdot\frac{\partial Q(s_{i},a_{i};\mathbf{w})}{\partial\mathbf{w}} \\ &\bullet\text{ SGD: w}\leftarrow\mathbf{w}-\alpha\cdot\mathbf{g}_i. \end{aligned} ∙ Find w by minimizing L(w)=T1t=1∑T2δt</

本文介绍了在训练DQN中使用经验回放技术,通过重复利用离线数据并利用连续经验的相关性,通过随机抽样和最小化TD误差进行学习。使用了随机梯度下降法(SGD),包括minibatch抽样和ReplayBuffer的实现。
最低0.47元/天 解锁文章
1006





