1. 探索上下文语境(Exploiting Context)
上下文语境特征往往提供关于正确标记的强大线索——例如:标注词fly,如果知道它前面的词是“a”将使我们能够确定它是一个名词,而不是一个动词。如果前面的词是“to”显然它是一个动词。所以今天我们构造的词性分类器,它的特征检测器检查一个词出现的上下文以便决定应该分配的词性标记。特别的,前面的词被作为一个特征。
>>> def pos_features(sentence, i):
... features = {"suffix(1)": sentence[i][-1:],
... "suffix(2)": sentence[i][-2:],
... "suffix(3)": sentence[i][-3:]}
... if i == 0:
... features["prev-word"] = "<START>"
... else:
... features["prev-word"] = sentence[i-1]
... return features
>>> print brown.sents()[0]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.']
>>> pos_features(brown.sents()[0], 8)
{'suffix(3)': u'ion', 'prev-word': u'an', 'suffix(2)': u'on', 'suffix(1)': u'n'}
>>> tagged_sents = brown.tagged_sents(categories='news')
>>> featuresets = []
>>> for tagged_sent in tagged_sents:
... untagged_sent = nltk.tag.untag(tagged_sent)
... for i, (word, tag) in enumerate(tagged_sent):
... featuresets.append( (pos_features(untagged_sent, i), tag) )
...
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.7891596220785678
2. 序列分类( Sequence Classification )
(没有理解的很好,先跳过)