妄想偏见Delusion Bias_admissible policy-CSDN博客

本文链接：https://blog.csdn.net/Fox_Alex/article/details/109018726

妄想偏见(Delusional Bias)（1）

本文为我对妄想偏见的理解，妄想偏见是NIPS2018的文章《Non-delusional Q-learning and value-iteration》首次提出的，指出了值函数迭代方法中采用函数近似器（或其他形式DP）时所产生的固有问题——妄想偏见。妄想偏见这一概念非常绕，而且文章也比较难，我将分为多次博客来详细介绍。

1. What is Delusional Bias?

Q-learning（或其它形式DP）当采用函数近似时会出现妄想偏见：更新基于相互不一致的值(approximate Q-learning suffers from delusional bias, in which updates are based on mutually inconsistent values)。

2. When?

当近似结构限制了可表示的贪婪policy的类别时，会产生妄想偏见(Delusional bias arises when the approximation architecture limits the class of expressible greedy policies)

3. Why?

不一致性出现是因为状态动作对(s,a)的Q-更新基于下一状态下所有动作的最大值估计，而忽略了如此考虑的动作(包括在s选择动作a)可能无法通过源于近似器产生的一组可允许策略共同实现这一个事实(This inconsistency arises because the Q-update for a stateaction pair, (s,a), is based on the maximum value estimate over all actions at the next state, which ignores the fact that the actions so-considered (including the choice of a at s) might not be jointly realizable given the set of admissible policies derived from the approximator)。

4. Consequence

无约束的更新给目标值带来错误，并导致明显的值估计错误来源：Q-learning很容易根据贪心策略类无法实现的动作选择来备份value。妄想偏见是一个固有问题，它会影响Q-update与受约束策略类的交互，这是更具表现力的近似器、更大的训练集和更多的计算所不能解决。

5. How to eliminate Delusional Bias?

policy-consistent backup operator: 不只是为每个状态动作对简单查找一个单独的future value，而是查找一个候选value集合（每一个都有一组相关的策略保证(commitment)来证明它）。这些方法通过对生成的值进行策略一致性的显示约束来补充值迭代和Q-learning的基于value的性质，并使用这些值从允许的策略类中选择策略(These methods complement the value-based nature of value iteration and Q-learning with explicit constraints on the policies consistent with generated values, and use the values to select policies from the admissible policy class)。方法被证明了在具有策略约束的表格情况下（将妄想error与近似误差隔离开），算法会收敛到可允许策略类的最佳策略，并且当贪婪策略类具有有限VC维时，信息集的数量是多项式有界的，因此在表格情况下，算法具有多项式时间迭代复杂度。缺陷：一致性的备份可能导致信息集激增，因此建议将搜索试探法集中在有前景的信息集上，并建议在一批训练数据中强加（或近似）策略一致性的方法，以努力使近似器趋向更好的估计。

我自己画的妄想偏见思维导图：