我看 John Lafferty 的那篇
CRF 论文里谈到了 label bias problem,不过解释得并不是非常的清晰, 我当时看的不太明白, 在网络上查一下, 发现一个 mail-list 上讨论这个问题, 解释的比较清楚。
更多的讨论参看 http://wing.comp.nus.edu.sg/pipermail/graphreading/2005-September/000031.html
更多的讨论参看 http://wing.comp.nus.edu.sg/pipermail/graphreading/2005-September/000031.html
想了解 label bias problem 先看John Lafferty 的那篇牛论文
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
------------------------------------
The label bias problem arises because certain state transitions ignore observations due to the way the state transition diagram is constructed. Specifically, in the example given, transitions 1->2, 2->3, 4->5 and 5->3 MUST have probability 1, regardless of the observation, i.e. Pr(2|1,"i")=Pr(2|1,"o")=1. You may wonder how on earth Pr(2|1,"o") could be equal to 1. Simple enough, because it has no choice since it got just one valid out-transition. This causes disparity between training and inference: suppose we have seen 3 rib and 1 rob. Note that MEMM is a supervised training method, which means you will have them labeled when given to you: rib will be labeled as 0123 and rob will be labeled as 0453. Now in training, by per-label-normalization, which enables you to decompose the joint probability of the sequence and train local models separately, you will find that: Pr(1|0,"r")=0.75
and Pr(4|0,"r")=0.25 because in the training examples you see 0->1 3 times more often than 0->4, when presented with "r". Now I do not even bother to "train" the other parameters since they all must be equal to 1, due to the construction of the state diagram, given any observation.
Now during inference, suppose we are given a sequence "rob" to label. We want to find out the label sequence:
(x1,x2) =argmax_(x1,x2) Pr(0->x1->x2->3|"rob")
=argmax_(x1,x2) Pr(x1|0,"r")Pr(x2|x1,"o")Pr(3|x2,"b")
And not surprisingly, you will output label sequence 0123 instead of the correct 0453. This is due to, as I pointed out at the very beginning, Pr(2|1,"o")=1, which is crazy but true, and it also happens that Pr(1|0,"r")>Pr(4|0,"r"). This means that as along as one sequence is just slightly more frequent than the other in the training data, during inference we will always select that one.
This can be solved by, say: Pr(2|1,"i")=1, Pr(2|1,"o")=0 and similar designs for Pr(5|4,"o") and Pr(5|4,"i"). This means we let certain transitions, in this case Pr(2|1,"i") vote more strongly than other competing transitions, e.g. Pr(2|1,"o") (note that this two transitions are uncorrelated). This certainly violates the normalization requirement because Pr(x|1,"o") does not sum to 1 now (required by the axiom of probability), thus we need to explicitly allow this so that the training procedure will cooperate. We do this by introducing "weights" for each transition, i.e. an extra "strength" of vote and add a normalization constant in the denominator to fulfill normalization requirement of the whole sequence.
Alternatively as I mentioned in my previous email, you can design another objective function instead of the log likelihood of data. This is just another clever way to compensate for breaking away from the axiom of probability. This is an approach I would like to avoid due to its ad-hocness. I'd prefer to tweak on the application-level representation of probability but still stick to the underlying formalism of probability, which would enable me to make use of well-developed concepts, such as maximum likelihood principle. But this is purely a personal choice!