贴一个关于 label bias problem 的解释


我看 John Lafferty 的那篇   CRF 论文里谈到了 label bias problem,不过解释得并不是非常的清晰, 我当时看的不太明白, 在网络上查一下, 发现一个 mail-list 上讨论这个问题, 解释的比较清楚。
更多的讨论参看 http://wing.comp.nus.edu.sg/pipermail/graphreading/2005-September/000031.html

想了解 label bias problem 先看John Lafferty 的那篇牛论文
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

------------------------------------
The label bias problem arises because certain state transitions ignore observations due to the way the state transition diagram is constructed. Specifically, in the example given, transitions 1->2, 2->3, 4->5 and 5->3 MUST have probability 1, regardless of the observation, i.e. Pr(2|1,"i")=Pr(2|1,"o")=1. You may wonder how on earth Pr(2|1,"o") could be equal to 1. Simple enough, because it has no choice since it got just one valid out-transition. This causes disparity between training and inference: suppose we have seen 3 rib and 1 rob. Note that MEMM is a supervised training method, which means you will have them labeled when given to you: rib will be labeled as 0123 and rob will be labeled as 0453. Now in training, by per-label-normalization, which enables you to decompose the joint probability of the sequence and train local models separately, you will find that: Pr(1|0,"r")=0.75       and Pr(4|0,"r")=0.25 because in the training examples you see 0->1 3 times more often than 0->4, when presented with "r". Now I do not even bother to "train" the other parameters since they all must be equal to 1, due to the construction of the state diagram, given any observation. 

Now during inference, suppose we are given a sequence "rob" to label. We want to find out the label sequence:
(x1,x2) =argmax_(x1,x2) Pr(0->x1->x2->3|"rob")
               =argmax_(x1,x2) Pr(x1|0,"r")Pr(x2|x1,"o")Pr(3|x2,"b")
And not surprisingly, you will output label sequence 0123 instead of the correct 0453. This is due to, as I pointed out at the very beginning, Pr(2|1,"o")=1, which is crazy but true, and it also happens that Pr(1|0,"r")>Pr(4|0,"r"). This means that as along as one sequence is just slightly more frequent than the other in the training data, during inference we will always select that one.

This can be solved by, say: Pr(2|1,"i")=1, Pr(2|1,"o")=0 and similar designs for Pr(5|4,"o") and Pr(5|4,"i"). This means we let certain transitions, in this case Pr(2|1,"i") vote more strongly than other competing transitions, e.g. Pr(2|1,"o") (note that this two transitions are uncorrelated). This certainly violates the normalization requirement because Pr(x|1,"o") does not sum to 1 now (required by the axiom of probability), thus we need to explicitly allow this so that the training procedure will cooperate. We do this by introducing "weights" for each transition, i.e. an extra "strength" of vote and add a normalization constant in the denominator to fulfill normalization requirement of the whole sequence.

Alternatively as I mentioned in my previous email, you can design another objective function instead of the log likelihood of data. This is just another clever way to compensate for breaking away from the axiom of probability. This is an approach I would like to avoid due to its ad-hocness. I'd prefer to tweak on the application-level representation of probability but still stick to the underlying formalism of probability, which would enable me to make use of well-developed concepts, such as maximum likelihood principle. But this is purely a personal choice!
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值