2.语料标注
import nltk
词干提取
porter = nltk.PorterStemmer()
porter.stem(‘lying’)
词性标注器
text = nltk.word_tokenize(“And now for something completely different”)
nltk.pos_tag(text)
手动标记
tagged_token = nltk.tag.str2tuple(‘fly/NN’)
tagged_token
中文支持
sent = ‘我/NN 是/IN 一个/AT 学习者/NN’
[nltk.tag.str2tuple(t) for t in sent.split()]
布朗语料库,第一个机读平衡语料库,连贯英语书面语500+,每个文本词数2000+,整个语料库约一百零一万四千三百词
nltk.corpus.brown.tagged_words()
for word in nltk.corpus.sinica_treebank.tagged_words():
print (word[0], word[1])
词性自动标注:默认名词(13%)
default_tagger = nltk.DefaultTagger('NN')
raw = '我 累 个 去'
tokens = nltk.word_tokenize(raw)
tags = default_tagger.tag(tokens)
print (tags)
正则标注
pattern = [(r'.*们$','PRO')]
tagger = nltk.RegexpTagger(pattern)
print (tagger.tag(nltk.word_tokenize('我们 累 个 去 你们 和 他们 啊')))
查询标注器–一元标注
from nltk.corpus import brown
tagged_sents = [[(u'我', u'PRO'), (u'小兔', u'NN')]]
unigram_tagger = nltk.UnigramTagger(tagged_sents) # 此处tagged_sents也可以换成别的语料库
sents = brown.sents(categories='news')
sents = [[u'我', u'你', u'小兔']]
tags = unigram_tagger.tag(sents[0])
print (tags)
标记器的存储与加载
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()
from cPickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()