label bias problem

最新推荐文章于 2021-07-02 15:10:50 发布

fengchao_ict

最新推荐文章于 2021-07-02 15:10:50 发布

阅读量2.2k

点赞数 1

文章标签： training transition parameters output function email

今天重读CRF的开山论文，发现对于label bias problem的问题还是不明白，于是就找了一个人转载的解释，在此翻译下。

The label bias problem arises because certain state transitions ignore observations due to the way the state transition diagram is constructed.Specifically, in the example given, transitions 1->2, 2->3, 4->5 and5->3 MUST have probability 1, regardless of the observation, i.e.Pr(2|1,"i")=Pr(2|1,"o")=1. You may wonder how on earth Pr(2|1,"o") couldbe equal to 1. Simple enough, because it has no choice since it got just one valid out-transition.This causes disparity between training and inference:

[译]label bias 问题来源于隐式状态的转移控制了显式状态的转移，这里举一个例子，隐式转移链如下：

1->2, 2->3, 4->5 and5->3

（如果只有它们的话，那么）它们每一对状态间的转移概率将都为1，无论显式状态是什么，例如：Pr(2|1,"i")=Pr(2|1,"o")=1。你可能在想究竟为什么会得到这样的结果，原因很简单，因为只有一个转移方向，所以我们无从选择。这样会导致训练和推断的偏差。

suppose we have seen 3 rib and 1 rob. Note that MEMM is a supervised training method,which means you will have them labeled when given to you: rib will be labeled as 0123 and rob will be labeled as 0453. Now in training, by per-label-normalization, which enables you to decompose the joint probability of the sequence and train local models separately, you will find that:Pr(1|0,"r")=0.75 and Pr(4|0,"r")=0.25because in the training examples you see 0->1 3 times more often than 0->4, when presented with "r". Now I do not even bother to "train" the other parameters since they all must be equal to 1, due to the construction of the state diagram, given any observation.Now during inference, suppose we are given a sequence "rob" to label. We want to find out the label sequence:
(x1,x2) =argmax_(x1,x2) Pr(0->x1->x2->3|"rob")
=argmax_(x1,x2) Pr(x1|0,"r")Pr(x2|x1,"o")Pr(3|x2,"b")

[译]（我们来举一个更详细的例子吧！）假设我们看到了3次rib这个单词，以及1次rob这个单词。我们注意到MEMM是一个监督训练方法，当给了我们这些单词，我们需要给他们标记label。比方说人工标注的结果是：rib被标注为0123，rob被标注为0453。下面进入训练过程了。通过per-label-normalization这个过程，使得我们可以分解联合概率从而分别训练模型。所以我们会发现如下的结果：

Pr(1|0,"r")=0.75 and Pr(4|0,"r")=0.25

因为在训练样本中我们观察到了当开始位置出现了字母“r”时，0->1的出现次数是0->4的三倍。我们不再对其他模型参数进行训练，因为其他的转移概率全部为1。下面进入推断过程，假设我们得到了一个序列“rob”，我们想要找出它的label序列，我们想要找到label序列：

(x1,x2) =argmax_(x1,x2) Pr(0->x1->x2->3|"rob")
=argmax_(x1,x2) Pr(x1|0,"r")Pr(x2|x1,"o")Pr(3|x2,"b")

And not surprisingly, you will output label sequence 0123 instead of the correct 0453. This is due to, as I pointed out at the very beginning,Pr(2|1,"o")=1, which is crazy but true, and it also happens that Pr(1|0,"r")>Pr(4|0,"r"). This means that as along as one sequence is just slightly more frequent than the other in the training data, during
inference we will always select that one.This can be solved by, say:Pr(2|1,"i")=1, Pr(2|1,"o")=0and similar designs for Pr(5|4,"o") and Pr(5|4,"i"). This means we let certain transitions, in this case Pr(2|1,"i") vote more strongly than other competing transitions, e.g. Pr(2|1,"o") (note that this two transitions are uncorrelated). This certainly violates the normalization requirement because Pr(x|1,"o") does not sum to 1 now (required by the axiom of probability), thus we need to explicitly allow this so that the
training procedure will cooperate.

[译]毫无疑问，我们将会输出序列0123而不是0453。这是因为我们在上面指出 Pr(1|0,"r")>Pr(4|0,"r")。这意味着在训练数据中只要一个序列比其他的序列出现得频繁一些，那么在推断过程中我们总会选择那一个。这是因为，在后面的推断过程中，

Pr(2|1,"i")=1, Pr(2|1,"o")=0

同样地还有

Pr(5|4,"o") and Pr(5|4,"i")

这意味着我们让某个指定的转移概率Pr(2|1,"i")比其他相竞争的转移概率Pr(2|1,"o") 具有更强的竞争力。

We do this by introducing "weights" for each transition, i.e. an extra "strength" of vote and add a normalization constant in the denominator to fulfill normalization requirement of the whole sequence.Alternatively as I mentioned in my previous email, you can design another objective function instead of the log likelihood of data. This is just another clever way to compensate for breaking away from the axiom of probability. This is an approach I would like to avoid due to its ad-hocness. I'd prefer to tweak on the application-level representation of probability but still stick to the underlying formalism of probability, which would enable me to make use of well-developed concepts, such as maximum likelihood principle. But this is purely a personal choice!

fengchao_ict

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
label bias problem

今天重读CRF的开山论文，发现对于label bias problem的问题还是不明白，于是就找了一个人转载的解释，在此翻译下。The label bias problem arises because certain state transitions ignore observations due to the way the state transition diagram is con
复制链接

扫一扫