【自然语言处理】标注体系:IO, BIO, BMEWO, and BMEWO+

IO Encoding

The simplest encoding is the IO encoding, which tags each token as either being in (I_ X ) a particular type of named entity type X or in no entity (O). This encoding is defective in that it can’t represent two entities next to each other, because there’s no boundary tag.

BIO Encoding

The “industry standard” encoding is the BIO encoding (anyone know who invented this encoding?). It subdivides the in tags as either being begin-of-entity (B_ X ) or continuation-of-entity (I_ X ).

BMEWO Encoding

The BMEWO encoding further distinguishes end-of-entity (E_ X ) tokens from mid-entity tokens (M_ X ), and adds a whole new tag for single-token entities (W_ X ). I believe the BMEWO encoding was introduced in Andrew Borthwick’s NYU thesis and related papers on “max entropy” named entity recognition around 1998, following Satoshi Sekine’s similar encoding for decision tree named entity recognition. (Satoshi and David Nadeau just released their Survey of NER .)

BMEWO+ Encoding

I introduced the BMEWO+ encoding for the LingPipe HMM-based chunkers . Because of the conditional independence assumptions in HMMs, they can’t use information about preceding or following words. Adding finer-grained information to the tags themselves implicitly encodes a kind of longer-distance information. This allows a different model to generate words after person entities (e.g. John said ), for example, than generates words before location entities (e.g. in Boston). The tag transition constraints (B_ X must be followed by M_ X or E_ X , etc.) propagate decisions, allowing a strong location-preceding word to trigger a location.
Note that it also adds a begin and end of sequence subcategorization to the out tags. This helped reduce the confusion between English sentence capitalization and proper name capitalization.

  • 5
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值