QQ:231469242
欢迎喜欢nltk朋友交流
http://www.cnblogs.com/undercurrent/p/4754944.html
一、信息提取模型
信息提取的步骤共分为五步,原始数据为未经处理的字符串,
第一步:分句,用nltk.sent_tokenize(text)实现,得到一个list of strings
第二步:分词,[nltk.word_tokenize(sent) for sent in sentences]实现,得到list of lists of strings
第三步:标记词性,[nltk.pos_tag(sent) for sent in sentences]实现得到一个list of lists of tuples
前三步可以定义在一个函数中:
>>> defie_preprocess(document):
... sentences =nltk.sent_tokenize(document)
... sentences = [nltk.word_tokenize(sent) for sent insentences]
... sentences = [nltk.pos_tag(sent) for sent in sentences]
第四步:实体识别(entity detection)在这一步,既要识别已定