python NLTK识别字符串中的人名等,命名实体识别

环境配置:

1、cd到该文件夹,打开cmd命令行python配置环境:
输入:
pip install nltk
2、安装JDK进行java环境配置
安装jdk,安装包链接:链接:https://pan.baidu.com/s/1TTSVMrDZZ74gbjbgrUz4xw 提取码:mlnv
配置环境变量,详情参照:https://blog.csdn.net/qq_16085405/article/details/80700804
3、下载stanford-ner-2018-10-16进行人名识别环境配置
下载链接:https://pan.baidu.com/s/1FhM4ZSORNSPcncci7uzz2g 提取码:nhyw
下载Stanford NER的zip文件解压后的文件夹的路径为:E://stanford-ner-2018-10-16

源码

首先对大佬表示感谢,参考文章:点击
粘一下大佬的代码:(stanford-ner要比ner的效果好,详情看大佬原文)

import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk


class check(object):
    def __init__(self):
        pass

 
    def parse_document(document):
        document = re.sub('\n', ' ', document)
        if isinstance(document, str):
            document = document
        else:
            raise ValueError('Document is not string!')
        document = document.strip()
        sentences = nltk.sent_tokenize(document)
        sentences = [sentence.strip() for sentence in sentences]
        return sentences


    def check_name(article_content):
        sentences = check.parse_document(article_content)

        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

        # set java path in environment variables
        java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
        os.environ['JAVAHOME'] = java_path
        # load stanford NER
        sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
                           path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')

        # tag sentences   最重要的一步分类算法
        ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
        # extract named entities
        named_entities = []
        for sentence in ne_annotated_sentences:
            temp_entity_name = ''
            temp_named_entity = None
            for term, tag in sentence:
           # get terms with NE tags
                if tag != 'O':
                    temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
                    temp_named_entity = (temp_entity_name, tag) # get NE and its category
                else:
                    if temp_named_entity:
                        named_entities.append(temp_named_entity)
                        temp_entity_name = ''
                        temp_named_entity = None

    # get unique named entities
        named_entities = list(set(named_entities))
        ###########      named_entities是识别结果      ##########
        name = []
        for n in named_entities:
            if n[1] == 'PERSON':
                name.append(n[0])
        return name

ttt = 'The case was prosecuted by Trial Attorney Joseph Palazzo of the Money Laundering and Asset Recovery Section and Assistant U.S. Attorneys Thomas A. Gillice, Luke Jones, Karen Seifert and Deborah Curtis and Special Assistant U.S. Attorney Jacqueline L. Barkett of the U.S. Attorney’s Office for the District of Columbia.'
name = check.check_name(ttt)
print(name)

debug

1、如果运行报错 NLTK was unable to find the java file! Use softwarespecific configuration paramaters or set the JAVA
参考:https://blog.csdn.net/LIUSHAO123456789/article/details/79486997

  • 6
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值