用pyltp做分词、词性标注、ner

最新推荐文章于 2024-08-21 21:59:29 发布

gbbb1234

最新推荐文章于 2024-08-21 21:59:29 发布

阅读量1w

点赞数 3

分类专栏： nlp 文章标签： nlp pyltp

本文链接：https://blog.csdn.net/gbbb1234/article/details/72676917

版权

nlp 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

工具：win10、python2.7

主要参考官方文档

http://pyltp.readthedocs.io/zh_CN/latest/api.html#

http://ltp.readthedocs.io/zh_CN/latest/install.html

1、安装pyltp

https://github.com/hit-scir/pyltp

别忘了下载网页里面的模型，这个是会更新的

下载源码后解压，用cmd命令切换到解压目录，用python setup.py install命令安装，在python中import pyltp不报错说明就成功了

2、安装cmake

https://cmake.org/download/

在链接里根据自己电脑型号下载.msi文件，打开后按照提示一步步安装就行

3、下载VS

这个大家的电脑基本都有吧

4、编译

在项目文件夹下新建一个名为 build 的目录，在cmd命令中切换到build目录，运行：cmake..

构建后得到ALL_BUILD、RUN_TESTS、ZERO_CHECK三个VC Project。

使用VS打开ALL_BUILD项目，在生成/配置管理器中选择Release

右键生成

就能在tools/train/Release目录下看到otcws和otpos等套件

5、分词

from pyltp import Segmentor
def segmentor(sentence):
    segmentor = Segmentor()
    segmentor.load('cws.model')  #加载模型
    words = segmentor.segment(sentence)  #分词
    word_list = list(words)
    segmentor.release()  #释放模型
    return word_list

个性化分词

个性化分词是LTP的特色功能。个性化分词为了解决测试数据切换到如小说、财经等不同于新闻领域的领域。在切换到新领域时，用户只需要标注少量数据。个性化分词会在原有新闻数据基础之上进行增量训练。从而达到即利用新闻领域的丰富数据，又兼顾目标领域特殊性的目的。

用cmd命令切换到tools/train/Release目录

输入：

otcws.exe customized-learn --baseline-modelpath/to/your/model --model name.model --reference path/to/the/reference/file --development path/to/the/development/file

等待

其中：

reference：指定训练集文件

development：指定开发集文件

algorithm：指定参数学习方法，现在LTP在线学习框架支持两种参数学习方法，分别是passive aggressive(pa)和average perceptron(ap)。

model：指定输出模型文件名前缀，模型采用model.$iter方式命名

max-iter：指定最大迭代次数

rare-feature-threshold：模型裁剪力度，如果rare-feature-threshold为0，则只去掉为0的特征；rare-feature-threshold；如果大于0时将进一步去掉更新次数低于阈值的特征。关于模型裁剪算法细节，请参考模型裁剪部分。

dump-details：指定保存模型时输出所有模型信息，这一参数用于个性化分词，具体请参考个性化分词。

需要注意的是，reference和development都需要是人工切分的句子。

6、词性标注

from pyltp import Postagger
def posttagger(words):
    postagger = Postagger()
    postagger.load('pos.model')
    posttags = postagger.postag(words)  #词性标注
    postags = list(posttags)
    postagger.release()  #释放模型
    return postags

7、ner

def ner(words, postags):
    recognizer = NamedEntityRecognizer()
    recognizer.load('ner.model')  #加载模型
    netags = recognizer.recognize(words, postags)  #命名实体识别
    for word, ntag in zip(words, netags):
        print word + '/' + ntag
    recognizer.release()  #释放模型
    nerttags = list(netags)

8、读取文本

import codecs
news_files = codecs.open('C:test.txt', 'r', encoding='utf8')#读取的文本格式是encoding参数值，codecs函数将其转化为unicode格式。news_list = news_files.readlines()




9、保存

#新建一个txt文件保存命名实体识别的结果
out_file = codecs.open('ner.txt', 'w', encoding='utf8')

for row in news_list:
    news_str = row.encode("utf-8")#分词参数输入的格式必须为str格式
    words = segmentor(news_str)
    tags = posttagger(words)
    nertags = ner(words, tags)
    for word, nertag in zip(words, nertags):
     out_file.write(word.decode('utf-8') + '/' + nertag.decode('utf-8') + ' ')

out_file.close()  



10、提取
import codecs
import re

file=codecs.open('/ner.txt','r',encoding='utf8')
file_content = file.read()
file_list = file_content.split()
#print file_list

out_file = codecs.open('tiqu.txt', 'w', encoding='utf8')

ner_list=[]
phrase_list=[]
for word in file_list:
    if(re.search('Ni$',word)):#$表示结尾
        print word
        out_file.write(word+' ')
        word_list=word.split('/')
        # 判断是否单个词是否是命名实体
        if re.search(r'^S', word_list[1]):
          ner_list.append(word_list[0])
        elif re.search(r'^B', word_list[1]):
          phrase_list.append(word_list[0])
        elif re.search(r'^I', word_list[1]):
          phrase_list.append(word_list[0])
        else:
          phrase_list.append(word_list[0])
          # 把list转换为字符串.
    ner_phrase = ''.join(phrase_list)
    ner_list.append(ner_phrase)
    phrase_list = []
    #for ner in ner_list:

        #print ner


out_file.close()