《python 数据挖掘概念、方法与实践》第6章 文本命名实体识别

本文提供英文实体命名识别的实战教程,对原书《python数据挖掘概念、方法与实践》第六章的代码进行了大幅修改和注释,包括代码重造,并通过许渊冲《春晓》英文翻译的短文本示例展示结果。
摘要由CSDN通过智能技术生成

本文章将对英文实体命名识别进行可操作的实战,对原书代码有较大的改动和进行代码注释

脉络如下:

  • 介绍原书代码
  • 更改文本
  • 进行代码重造
    •    进行代码重造

对原书的代码及文本

python 数据挖掘概念、方法与实践代码点击打开链接

进行操作,出现一些问题,原书代码如下:
import nltk
import pprint

# sample files that we use in this chapter
#filename = 'apacheMeetingMinutes.txt'
filename = 'djangoIRCchat.txt'
#filename = 'gnueIRCsummary.txt'
# filename = 'E:\Mypython\mypython\masteringDM-master\ch6\lkmlEmailsReduced.txt'

with open(filename, 'r', encoding='utf8') as sampleFile:
    text=sampleFile.read()

en = {}
try:   
    sent_detector = nltk.data.load('D:/local/Anaconda3/Lib/nltk-data//tokenizers/punkt/english.pickle')#这个地址是我本地地址
    sentences = sent_detector.tokenize(text.strip())
    
    for sentence in sentences:
        tokenized = nltk.word_tokenize(sentence)
        tagged = nltk.pos_tag(tokenized)
        chunked = nltk.ne_chunk(tagged)
      
        # this was the original code in the book:
        '''
        for tree in chunked:
            if hasattr(tree, 'label'):
                ne = ' '.join(c[0] for c in tree.leaves())
                en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
        '''
        # here is another way to write it that might be easier for new programmers"
        for tree in chunked:
            if hasattr(tree, 'label'):
                ne = ''
                for c in tree.leaves():
                    ne += c[0] + ' '
                ne = ne.rstrip()
                en[ne] = tree.label()
    for key in en.keys():
        print(key, ':', en[key])
except Exception as e:
    print(str(e))

运行之后不能得到经词性标注的结果:

<urlopen error unknown url type: d>

Process finished with exit code 0

对NLTK学习之后,进行如下更改:文本可以用以前的,我现在用一个许渊冲老先生的《春晓》英文翻译进行,短文本结果便于展示:

A Spring Morning
Meng Haoran
Translated by Pr. Xu Yuanchong
This spring morning in bed I'm lying,
Not to awake till birds are crying.
After one night of wind and showers,
How many are the fallen flowers!




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值