python NLTK识别字符串中的人名等，命名实体识别

最新推荐文章于 2024-07-10 03:01:55 发布

生如夏花~之绚烂

最新推荐文章于 2024-07-10 03:01:55 发布

阅读量7.2k

点赞数 6

分类专栏： python 机器学习

本文链接：https://blog.csdn.net/qq_41663800/article/details/99559734

版权

python 同时被 2 个专栏收录

120 篇文章 1 订阅

订阅专栏

机器学习

12 篇文章 0 订阅

订阅专栏

环境配置：

1、cd到该文件夹，打开cmd命令行python配置环境：
输入：
pip install nltk
2、安装JDK进行java环境配置
安装jdk，安装包链接：链接：https://pan.baidu.com/s/1TTSVMrDZZ74gbjbgrUz4xw 提取码：mlnv
配置环境变量，详情参照：https://blog.csdn.net/qq_16085405/article/details/80700804
3、下载stanford-ner-2018-10-16进行人名识别环境配置
下载链接：https://pan.baidu.com/s/1FhM4ZSORNSPcncci7uzz2g 提取码：nhyw
下载Stanford NER的zip文件解压后的文件夹的路径为：E://stanford-ner-2018-10-16

源码

首先对大佬表示感谢，参考文章：点击
粘一下大佬的代码：（stanford-ner要比ner的效果好，详情看大佬原文）

import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk


class check(object):
    def __init__(self):
        pass

 
    def parse_document(document):
        document = re.sub('\n', ' ', document)
        if isinstance(document, str):
            document = document
        else:
            raise ValueError('Document is not string!')
        document = document.strip()
        sentences = nltk.sent_tokenize(document)
        sentences = [sentence.strip() for sentence in sentences]
        return sentences


    def check_name(article_content):
        sentences = check.parse_document(article_content)

        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

        # set java path in environment variables
        java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
        os.environ['JAVAHOME'] = java_path
        # load stanford NER
        sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
                           path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')

        # tag sentences   最重要的一步分类算法
        ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
        # extract named entities
        named_entities = []
        for sentence in ne_annotated_sentences:
            temp_entity_name = ''
            temp_named_entity = None
            for term, tag in sentence:
           # get terms with NE tags
                if tag != 'O':
                    temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
                    temp_named_entity = (temp_entity_name, tag) # get NE and its category
                else:
                    if temp_named_entity:
                        named_entities.append(temp_named_entity)
                        temp_entity_name = ''
                        temp_named_entity = None

    # get unique named entities
        named_entities = list(set(named_entities))
        ###########      named_entities是识别结果      ##########
        name = []
        for n in named_entities:
            if n[1] == 'PERSON':
                name.append(n[0])
        return name

ttt = 'The case was prosecuted by Trial Attorney Joseph Palazzo of the Money Laundering and Asset Recovery Section and Assistant U.S. Attorneys Thomas A. Gillice, Luke Jones, Karen Seifert and Deborah Curtis and Special Assistant U.S. Attorney Jacqueline L. Barkett of the U.S. Attorney’s Office for the District of Columbia.'
name = check.check_name(ttt)
print(name)

debug

1、如果运行报错 NLTK was unable to find the java file! Use softwarespecific configuration paramaters or set the JAVA
参考：https://blog.csdn.net/LIUSHAO123456789/article/details/79486997

生如夏花~之绚烂

关注

6
点赞
踩
11

收藏

觉得还不错? 一键收藏
3
评论
python NLTK识别字符串中的人名等，命名实体识别

环境配置：1、cd到该文件夹，打开cmd命令行python配置环境：输入：pip install nltk2、安装JDK进行java环境配置安装jdk，安装包链接：链接：https://pan.baidu.com/s/1TTSVMrDZZ74gbjbgrUz4xw 提取码：mlnv配置环境变量，详情参照：https://blog.csdn.net/qq_16085405/article...
复制链接

扫一扫

专栏目录