Python 自然语言处理 命名 实体识别

NER系统的构建与评估:

1.将文档分割成句子

2.将句子分割为单词

3.标记每个单词的词性

4.从标记单词集中识别出命名实体

5.识别每个命名实体的分类

6.评估

NLTK(Natural Language Toolkit)自然语言处理工具包,在NLP领域中,比较常用的一个Python库。它提供了易于使用的接口,通过这些接口可以访问超过50个语料库和词汇资源(如WordNet),还有一套用于分类、标记化、词干标记、解析和语义推理的文本处理库,以及工业级NLP库的封装器和一个活跃的讨论论坛。

官方文档:http://www.nltk.org
python安装与测试NLTK链接:https://blog.csdn.net/shuihupo/article/details/79635044
百度api自然语言处理的调用 :https://blog.csdn.net/m0_37788308/article/details/79994499

基于NLTK的命名实体识别:

读入英文文本数据

# -*- coding: utf-8 -*- 
import nltk  
import pprint
# filename = "test.txt"
# with open(filename, 'r', encoding='utf8') as sampleFile:
#     text=sampleFile.read()

减少迁移带来的文本路径问题,我们将文本存为变量

text = "Mexico quakes with joy over World Cup upset win.Mexico’s Earthquake Early Warning and Monitoring System issued a message on the 17th that the Mexican team played against the German team in the World Cup. During the first half of the game until the 35th minute, the Mexican team striker Losano broke the deadlock and scored the first goal, scoring a goal in Mexico. The city monitored minor earthquakes. This monitoring system analyzes that the earthquake was caused by man-made methods or caused by many people excitedly jumping when scoring."
en = {} 

1.将文档分割成句子
2.将句子分割为单词
3.标记每个单词的词性

tokenized = nltk.word_tokenize(text) #分词  
# pprint.pprint(tokenized)
tagged = nltk.pos_tag(tokenized)         #词性标注 
#pprint.pprint(tagged)
chunked = nltk.ne_chunk(tagged)          #命名实体识别 

NN 名词 year,home, costs, time, education

NNS 名词复数 undergraduates scotches

NNP 专有名词 Alison,Africa,April,Washington

NNPS 专有名词复数 Americans Americas Amharas Amityvilles

#pprint.pprint(chunked)# <class 'nltk.tree.Tree'>
print(chunked.draw())
None
要注意对树的处理,观察树的形式
for tree in chunked:
    # print(tree)
    # print(type(tree)) 非专有名词为'tuple',是专有名词的为“tree”
    if hasattr(tree, 'label'):
        #print(tree.draw())
        ne = ' '.join(c[0] for c in tree.leaves())
        en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
for key in en.keys():
    print(key, ':', en[key])
Mexican : ['GPE', 'NNP']
Early Warning : ['PERSON', 'JJ NNP']
German : ['GPE', 'JJ']
Monitoring System : ['ORGANIZATION', 'NNP NNP']
Mexico : ['GPE', 'NNP']
Losano : ['PERSON', 'NNP']

百度API实现

textnew= "世界杯爆冷门,墨西哥球迷激动跳跃引发首都墨西哥城地震!墨西哥地震预警监控系统17日发布消息,当天墨西哥队在对阵德国队的世界杯比赛中,上半场比赛进行至第35分钟时,墨西哥队前锋洛萨诺打破僵局攻入首球,进球时墨西哥城监测到轻微地震。这一监控系统分析说,这次地震是由人为方式引发,或因进球时许多民众激动跳跃造成。"
# -*- coding: utf-8 -*-
import urllib3
import json
import urllib.request 
import pprint

第一步:获取access_token
client_id 为官网获取的AK, client_secret 为官网获取的SK

access_token ="24.340837cdde292a61442507b60e6fb64c.2592000.1532060546.282335-11012308"

第二步:post请求调用API,传入参数

import sys
print(sys.getdefaultencoding())
http=urllib3.PoolManager()
url = "https://aip.baidubce.com/rpc/2.0/nlp/v1/lexer?access_token="+access_token
print(url)
data ={
  "text":textnew}
utf-8
https://aip.baidubce.com/rpc/2.0/nlp/v1/lexer?access_token=24.340837cdde292a61442507b60e6fb64c.2592000.1532060546.282335-11012308
encode_data= json.dumps(data).encode('GBK') #传入数据是字典,需要编码
#JSON:在发起请求时,可以通过定义body 参数并定义headers的Content-Type参数来发送一个已经过编译的JSON数据:
request = http.request('POST',
                       url,
                       body=encode_data,
                       headers={
  'Content-Type':'application/json'}
                       )
result = str(request.data,'GBK')
D:\local\Anaconda3\lib\site-packages\urllib3\connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
result_dir = eval(result)
pprint.pprint(result_dir)
{'items': [{'basic_words': ['世界', '杯'],
            'byte_length': 6,
            'byte_offset': 0,
            'formal': '',
            'item': '世界杯',
            'loc_details': [],
            'ne': '',
            'pos': 'nz',
            'uri': ''},
           {'basic_words': ['爆冷', '门'],
            'byte_length': 6,
            'byte_offset': 6,
            'formal': '',
            'item': '爆冷门',
            'loc_details': [],
            'ne': '',
            'pos': 'nz',
            'uri': ''},
           {'basic_words': [','],
            'byte_length': 2,
            'byte_offset': 12,
            'formal': '',
            'item': ',',
            'loc_details': [],
            'ne': '',
            'pos': 'w',
            'uri': ''},
           {'basic_words': ['墨西哥'],
            'byte_length': 6,
            
  • 2
    点赞
  • 33
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值