Python 自然语言处理命名实体识别

最新推荐文章于 2024-06-02 10:38:06 发布

水...琥珀

最新推荐文章于 2024-06-02 10:38:06 发布

阅读量1.1w

点赞数

分类专栏： python自然语言文章标签： Python 命名实体识别

本文链接：https://blog.csdn.net/shuihupo/article/details/81541334

版权

NER系统的构建与评估：

1.将文档分割成句子

2.将句子分割为单词

3.标记每个单词的词性

4.从标记单词集中识别出命名实体

5.识别每个命名实体的分类

6.评估

NLTK（Natural Language Toolkit）自然语言处理工具包，在NLP领域中，比较常用的一个Python库。它提供了易于使用的接口，通过这些接口可以访问超过50个语料库和词汇资源（如WordNet），还有一套用于分类、标记化、词干标记、解析和语义推理的文本处理库，以及工业级NLP库的封装器和一个活跃的讨论论坛。

官方文档：http://www.nltk.org
python安装与测试NLTK链接：https://blog.csdn.net/shuihupo/article/details/79635044
百度api自然语言处理的调用：https://blog.csdn.net/m0_37788308/article/details/79994499

基于NLTK的命名实体识别：

读入英文文本数据

# -*- coding: utf-8 -*- 
import nltk

import pprint

# filename = "test.txt"
# with open(filename, 'r', encoding='utf8') as sampleFile:
#     text=sampleFile.read()

减少迁移带来的文本路径问题，我们将文本存为变量

text = "Mexico quakes with joy over World Cup upset win.Mexico’s Earthquake Early Warning and Monitoring System issued a message on the 17th that the Mexican team played against the German team in the World Cup. During the first half of the game until the 35th minute, the Mexican team striker Losano broke the deadlock and scored the first goal, scoring a goal in Mexico. The city monitored minor earthquakes. This monitoring system analyzes that the earthquake was caused by man-made methods or caused by many people excitedly jumping when scoring."

en = {}

1.将文档分割成句子
2.将句子分割为单词
3.标记每个单词的词性

tokenized = nltk.word_tokenize(text) #分词  
# pprint.pprint(tokenized)

tagged = nltk.pos_tag(tokenized)         #词性标注 
#pprint.pprint(tagged)

chunked = nltk.ne_chunk(tagged)          #命名实体识别

NN 名词 year,home, costs, time, education

NNS 名词复数 undergraduates scotches

NNP 专有名词 Alison,Africa,April,Washington

NNPS 专有名词复数 Americans Americas Amharas Amityvilles

#pprint.pprint(chunked)# <class 'nltk.tree.Tree'>

print(chunked.draw())

None

要注意对树的处理，观察树的形式

for tree in chunked:
    # print(tree)
    # print(type(tree)) 非专有名词为'tuple',是专有名词的为“tree”
    if hasattr(tree, 'label'):
        #print(tree.draw())
        ne = ' '.join(c[0] for c in tree.leaves())
        en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
for key in en.keys():
    print(key, ':', en[key])

Mexican : ['GPE', 'NNP']
Early Warning : ['PERSON', 'JJ NNP']
German : ['GPE', 'JJ']
Monitoring System : ['ORGANIZATION', 'NNP NNP']
Mexico : ['GPE', 'NNP']
Losano : ['PERSON', 'NNP']

百度API实现

textnew= "世界杯爆冷门，墨西哥球迷激动跳跃引发首都墨西哥城地震！墨西哥地震预警监控系统17日发布消息，当天墨西哥队在对阵德国队的世界杯比赛中，上半场比赛进行至第35分钟时，墨西哥队前锋洛萨诺打破僵局攻入首球，进球时墨西哥城监测到轻微地震。这一监控系统分析说，这次地震是由人为方式引发，或因进球时许多民众激动跳跃造成。"

# -*- coding: utf-8 -*-
import urllib3
import json
import urllib.request 
import pprint

第一步：获取access_token
client_id 为官网获取的AK， client_secret 为官网获取的SK

access_token ="24.340837cdde292a61442507b60e6fb64c.2592000.1532060546.282335-11012308"

第二步：post请求调用API，传入参数

import sys
print(sys.getdefaultencoding())
http=urllib3.PoolManager()
url = "https://aip.baidubce.com/rpc/2.0/nlp/v1/lexer?access_token="+access_token
print(url)
data ={
  "text":textnew}

utf-8
https://aip.baidubce.com/rpc/2.0/nlp/v1/lexer?access_token=24.340837cdde292a61442507b60e6fb64c.2592000.1532060546.282335-11012308

encode_data= json.dumps(data).encode('GBK') #传入数据是字典，需要编码
#JSON:在发起请求时,可以通过定义body 参数并定义headers的Content-Type参数来发送一个已经过编译的JSON数据：

request = http.request('POST',
                       url,
                       body=encode_data,
                       headers={
  'Content-Type':'application/json'}
                       )
result = str(request.data,'GBK')

D:\local\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

result_dir = eval(result)
pprint.pprint(result_dir)

{'items': [{'basic_words': ['世界', '杯'],
            'byte_length': 6,
            'byte_offset': 0,
            'formal': '',
            'item': '世界杯',
            'loc_details': [],
            'ne': '',
            'pos': 'nz',
            'uri': ''},
           {'basic_words': ['爆冷', '门'],
            'byte_length': 6,
            'byte_offset': 6,
            'formal': '',
            'item': '爆冷门',
            'loc_details': [],
            'ne': '',
            'pos': 'nz',
            'uri': ''},
           {'basic_words': ['，'],
            'byte_length': 2,
            'byte_offset': 12,
            'formal': '',
            'item': '，',
            'loc_details': [],
            'ne': '',
            'pos': 'w',
            'uri': ''},
           {'basic_words': ['墨西哥'],
            'byte_length': 6,

最低0.47元/天解锁文章

水...琥珀

关注

0
点赞
踩
33

收藏

觉得还不错? 一键收藏
4
评论
Python 自然语言处理命名实体识别

NER系统的构建与评估：1.将文档分割成句子2.将句子分割为单词3.标记每个单词的词性4.从标记单词集中识别出命名实体5.识别每个命名实体的分类6.评估NLTK（Natural Language Toolkit）自然语言处理工具包，在NLP领域中，比较常用的一个Python库。它提供了易于使用的接口，通过这些接口可以访问超过50个语料库和词汇资源...
复制链接

扫一扫