【自然语言处理】标注体系:IO, BIO, BMEWO, and BMEWO+

版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/qq280929090/article/details/79011660

IO Encoding

The simplest encoding is the IO encoding, which tags each token as either being in (I_X) a particular type of named entity type X or in no entity (O). This encoding is defective in that it can’t represent two entities next to each other, because there’s no boundary tag.

BIO Encoding

The “industry standard” encoding is the BIO encoding (anyone know who invented this encoding?). It subdivides the in tags as either being begin-of-entity (B_X) or continuation-of-entity (I_X).

BMEWO Encoding

The BMEWO encoding further distinguishes end-of-entity (E_X) tokens from mid-entity tokens (M_X), and adds a whole new tag for single-token entities (W_X). I believe the BMEWO encoding was introduced in Andrew Borthwick’s NYU thesis and related papers on “max entropy” named entity recognition around 1998, following Satoshi Sekine’s similar encoding for decision tree named entity recognition. (Satoshi and David Nadeau just released theirSurvey of NER.)

BMEWO+ Encoding

I introduced the BMEWO+ encoding for theLingPipe HMM-based chunkers. Because of the conditional independence assumptions in HMMs, they can’t use information about preceding or following words. Adding finer-grained information to the tags themselves implicitly encodes a kind of longer-distance information. This allows a different model to generate words after person entities (e.g. Johnsaid), for example, than generates words before location entities (e.g.in Boston). The tag transition constraints (B_Xmust be followed by M_Xor E_X, etc.) propagate decisions, allowing a strong location-preceding word to trigger a location.
Note that it also adds a begin and end of sequence subcategorization to the out tags. This helped reduce the confusion between English sentence capitalization and proper name capitalization.

展开阅读全文

没有更多推荐了,返回首页