————"Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications"
[114] Tsitsiklis, J. N., and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems (pp. 1075-1081).
在DRL中,一个常用的机制是建立soft-update的目标q网络,在程序中通常命名为 q_network_bar或q_network_target,该网络参数更新被滞后地赋予q_network的参数,该操作的目的实现 on-policy 下more stable的作用。