python 调用standford corenlp 对分好词的句子做命名实体识别

最新推荐文章于 2024-06-12 09:48:38 发布

vv,vv

最新推荐文章于 2024-06-12 09:48:38 发布

阅读量489

点赞数

文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/weixin_43344092/article/details/117342687

版权

from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP(r’./stanford-corenlp-4.2.0’,lang=‘zh’,logging_level=logging.DEBUG)

text = str（每个词用空格分隔）

方法一：

调用nlp.ner：
ner_result = nlp.ner(text)

更改corenlp.py文件的”_request()”函数，在232行“properties = {‘annotators’: annotators, ‘outputFormat’: ‘json’}”多加一个参数：‘tokenize.language’: ‘Whitespace’ =》properties = {‘annotators’: annotators, ‘outputFormat’: ‘json’,‘tokenize.language’: ‘Whitespace’}

方法二

调用 nlp.annotate：
ner_result = nlp.annotate(sentstr,properties={
‘annotators’: ‘ner’,
’tokenize.language’: ‘Whitespace’,
’pipelineLanguage’:‘zh’, # 这个参数要加上，对中文才起作用
‘outputFormat’: ‘json’
})

PS：
一：两种方法的原理一样：通过对比 annotate和ner的代码，发现都调用了r = requests.post(self.url, params=params, data=data, headers={‘Connection’: ‘close’})，不同的地方就在params的参数里，两种实现方法都是一样的原理，都是加上了’tokenize.language’: ‘Whitespace’这个参数，方法一之所以不需要’pipelineLanguage’:'zh’加参数，是因为方法一nlp.ner()，ner会调用内置方法“_request()”，内置方法“_request()”中的params已经有这个参数

二：如果句子里有‘%’号，无法使用命名实体对其进行注释。会报错’Could not handle incoming annotation’错误，这个在stanfordcorenlp的文档中也有提到https://www.bountysource.com/teams/stanfordnlp/issues

三：使用完后，一定要调用nlp.close()

参考：https://stackoverflow.com/questions/45299170/stanford-corenlp-tokenize-whitespace-property-not-working-on-chinese