《python 数据挖掘概念、方法与实践》第6章文本命名实体识别

最新推荐文章于 2024-03-21 17:08:33 发布

水...琥珀

最新推荐文章于 2024-03-21 17:08:33 发布

阅读量415

点赞数

分类专栏： python自然语言

本文链接：https://blog.csdn.net/shuihupo/article/details/79664898

版权

python自然语言专栏收录该内容

18 篇文章 3 订阅

订阅专栏

本文提供英文实体命名识别的实战教程，对原书《python数据挖掘概念、方法与实践》第六章的代码进行了大幅修改和注释，包括代码重造，并通过许渊冲《春晓》英文翻译的短文本示例展示结果。

摘要由CSDN通过智能技术生成

本文章将对英文实体命名识别进行可操作的实战，对原书代码有较大的改动和进行代码注释

脉络如下：

介绍原书代码
更改文本
进行代码重造
- 进行代码重造

对原书的代码及文本

python 数据挖掘概念、方法与实践代码点击打开链接

进行操作，出现一些问题，原书代码如下：

import nltk
import pprint

# sample files that we use in this chapter
#filename = 'apacheMeetingMinutes.txt'
filename = 'djangoIRCchat.txt'
#filename = 'gnueIRCsummary.txt'
# filename = 'E:\Mypython\mypython\masteringDM-master\ch6\lkmlEmailsReduced.txt'

with open(filename, 'r', encoding='utf8') as sampleFile:
    text=sampleFile.read()

en = {}
try:   
    sent_detector = nltk.data.load('D:/local/Anaconda3/Lib/nltk-data//tokenizers/punkt/english.pickle')#这个地址是我本地地址
    sentences = sent_detector.tokenize(text.strip())
    
    for sentence in sentences:
        tokenized = nltk.word_tokenize(sentence)
        tagged = nltk.pos_tag(tokenized)
        chunked = nltk.ne_chunk(tagged)
      
        # this was the original code in the book:
        '''
        for tree in chunked:
            if hasattr(tree, 'label'):
                ne = ' '.join(c[0] for c in tree.leaves())
                en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
        '''
        # here is another way to write it that might be easier for new programmers"
        for tree in chunked:
            if hasattr(tree, 'label'):
                ne = ''
                for c in tree.leaves():
                    ne += c[0] + ' '
                ne = ne.rstrip()
                en[ne] = tree.label()
    for key in en.keys():
        print(key, ':', en[key])
except Exception as e:
    print(str(e))

运行之后不能得到经词性标注的结果：

<urlopen error unknown url type: d>

Process finished with exit code 0

对NLTK学习之后，进行如下更改：文本可以用以前的，我现在用一个许渊冲老先生的《春晓》英文翻译进行，短文本结果便于展示：

A Spring Morning
Meng Haoran
Translated by Pr. Xu Yuanchong
This spring morning in bed I'm lying,
Not to awake till birds are crying.
After one night of wind and showers,
How many are the fallen flowers!

水...琥珀

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《python 数据挖掘概念、方法与实践》第6章文本命名实体识别

本文章将对英文实体命名识别进行可操作的实战，对原书代码有较大的改动和进行代码注释脉络如下：介绍原书代码更改文本进行代码重造进行代码重造对原书的代码及文本python 数据挖掘概念、方法与实践代码点击打开链接进行操作，出现一些问题，原书代码如下：import nltkimport pprint# sample files that we use in this chapter#file...
复制链接

扫一扫