本文章将对英文实体命名识别进行可操作的实战,对原书代码有较大的改动和进行代码注释
脉络如下:
- 介绍原书代码
- 更改文本
- 进行代码重造
-
- 进行代码重造
对原书的代码及文本
python 数据挖掘概念、方法与实践代码点击打开链接
进行操作,出现一些问题,原书代码如下:import nltk
import pprint
# sample files that we use in this chapter
#filename = 'apacheMeetingMinutes.txt'
filename = 'djangoIRCchat.txt'
#filename = 'gnueIRCsummary.txt'
# filename = 'E:\Mypython\mypython\masteringDM-master\ch6\lkmlEmailsReduced.txt'
with open(filename, 'r', encoding='utf8') as sampleFile:
text=sampleFile.read()
en = {}
try:
sent_detector = nltk.data.load('D:/local/Anaconda3/Lib/nltk-data//tokenizers/punkt/english.pickle')#这个地址是我本地地址
sentences = sent_detector.tokenize(text.strip())
for sentence in sentences:
tokenized = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokenized)
chunked = nltk.ne_chunk(tagged)
# this was the original code in the book:
'''
for tree in chunked:
if hasattr(tree, 'label'):
ne = ' '.join(c[0] for c in tree.leaves())
en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
'''
# here is another way to write it that might be easier for new programmers"
for tree in chunked:
if hasattr(tree, 'label'):
ne = ''
for c in tree.leaves():
ne += c[0] + ' '
ne = ne.rstrip()
en[ne] = tree.label()
for key in en.keys():
print(key, ':', en[key])
except Exception as e:
print(str(e))
运行之后不能得到经词性标注的结果:
<urlopen error unknown url type: d>
Process finished with exit code 0
对NLTK学习之后,进行如下更改:文本可以用以前的,我现在用一个许渊冲老先生的《春晓》英文翻译进行,短文本结果便于展示:
A Spring Morning
Meng Haoran
Translated by Pr. Xu Yuanchong
This spring morning in bed I'm lying,
Not to awake till birds are crying.
After one night of wind and showers,
How many are the fallen flowers!