pyltp引入外部词典

最新推荐文章于 2021-05-07 16:53:29 发布

登高博见凌云志

最新推荐文章于 2021-05-07 16:53:29 发布

阅读量2.8k

点赞数

分类专栏：自然语言处理文章标签： python

本文链接：https://blog.csdn.net/sinat_40631989/article/details/105106349

版权

自然语言处理专栏收录该内容

9 篇文章 1 订阅

订阅专栏

大家好，今天跟大家介绍一下在文本学习过程中，为什么要引入外部词典以及引入外部词典之后又什么变化。

为什么引入外部词典
怎么引入（外部词典的配置）
一、为什么引入？
pyltp分词支持用户使用自定义词典，分词外部词典本身是一个文本文件（*.txt）。每行指定一个词，编码必须为UTF-8。（保存文件的时候，设置编码为UTF-8）。

代码注意以下几点：
1、改变模型文件路径！
2、外部词典的加载路径代码。（如下图）

完整代码如下：

# -*- coding: utf-8 -*-
import os
from pyltp import Segmentor, Postagger
# 分词
LTP_DATA_DIR = 'E:\Python\pyltp\ltp\ltp\ltp_data'  # ltp模型目录的路径
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')  # 分词模型路径，模型名称为`cws.model`
lexicon_path = os.path.join(LTP_DATA_DIR, 'E:\Python\pyltp\ltp\ltp\ltp_data\lexicon.txt')  # 参数lexicon是自定义词典的文件路径
segmentor = Segmentor()
segmentor.load_with_lexicon(cws_model_path, lexicon_path)
sent = '据韩联社12月28日反映，美国防部发言人杰夫·莫莱尔27日表示，美国防部长盖茨将于2011年1月14日访问韩国。2010年2月28日中国刘军报道'
words = segmentor.segment(sent)  # 分词
# 词性标注
pos_model_path = os.path.join(LTP_DATA_DIR, 'pos.model')  # 词性标注模型路径，模型名称为`pos.model`
postagger = Postagger()  # 初始化实例
postagger.load(pos_model_path)  # 加载模型
postags = postagger.postag(words)  # 词性标注
ner_model_path = os.path.join(LTP_DATA_DIR, 'ner.model')   # 命名实体识别模型路径，模型名称为`pos.model`
from pyltp import NamedEntityRecognizer
recognizer = NamedEntityRecognizer() # 初始化实例
recognizer.load(ner_model_path)  # 加载模型
# netags = recognizer.recognize(words, postags)  # 命名实体识别
# 提取识别结果中的人名，地名，组织机构名
persons, places, orgs = set(), set(), set()
netags = list(recognizer.recognize(words, postags))  # 命名实体识别
print(netags)
# print(netags)
i = 0
for tag, word in zip(netags, words):
    j = i
    # 人名
    if 'Nh' in tag:
        if str(tag).startswith('S'):
            persons.add(word)
        elif str(tag).startswith('B'):
            union_person = word
            while netags[j] != 'E-Nh':
                j += 1
                if j < len(words):
                    union_person += words[j]
            persons.add(union_person)
    # 地名
    if 'Ns' in tag:
        if str(tag).startswith('S'):
            places.add(word)
        elif str(tag).startswith('B'):
            union_place = word
            while netags[j] != 'E-Ns':
                j += 1
                if j < len(words):
                    union_place += words[j]
            places.add(union_place)
    # 机构名
    if 'Ni' in tag:
        if str(tag).startswith('S'):
            orgs.add(word)
        elif str(tag).startswith('B'):
            union_org = word
            while netags[j] != 'E-Ni':
                j += 1
                if j < len(words):
                    union_org += words[j]
            orgs.add(union_org)
    i += 1
print('人名：', '，'.join(persons))
print('地名：', '，'.join(places))
print('组织机构：', '，'.join(orgs))
# 释放模型
segmentor.release()
postagger.release()
recognizer.release()