使用Standford coreNLP进行中文命名实体识别(NER)

最新推荐文章于 2022-12-23 20:57:17 发布

秦岭熊猫

最新推荐文章于 2022-12-23 20:57:17 发布

阅读量3.5k

点赞数 2

分类专栏：算法

本文链接：https://blog.csdn.net/tianshan2010/article/details/104012528

版权

Stanford CoreNLP是一个比较厉害的自然语言处理工具，很多模型都是基于深度学习方法训练得到的。先附上其官网链接：https://stanfordnlp.github.io/CoreNLP/index.html https://nlp.stanford.edu/nlp/javadoc/javanlp/ https://github.com/stanfordnlp/CoreNL...

摘要由CSDN通过智能技术生成

Stanford CoreNLP是一个比较厉害的自然语言处理工具，很多模型都是基于深度学习方法训练得到的。

先附上其官网链接：

https://stanfordnlp.github.io/CoreNLP/index.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/
https://github.com/stanfordnlp/CoreNLP

本文主要讲解如何在java工程中使用Stanford CoreNLP；

1.环境准备

3.5之后的版本都需要java8以上的环境才能运行。需要进行中文处理的话，比较占用内存，3G左右的内存消耗。

笔者使用的maven进行依赖的引入，使用的是3.9.1版本。

直接在pom文件中加入下面的依赖：

<dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.9.2</version>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.9.2</version>
            <classifier>models</classifier>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.9.2</version>
            <classifier>models-chinese</classifier>
        </dependency>

3个包分别是CoreNLP的算法包、英文语料包、中文预料包。这3个包的总大小为1.43G。maven默认镜像在国外，而这几个依赖包特别大，可以找有着三个依赖的国内镜像试一下。笔者用的是自己公司的maven仓库。

2.代码调用

需要注意的是，因为我是需要进行中文的命名实体识别，因此需要使用中文分词和中文的词典。

其中有个StanfordCoreNLP-chinese.properties文件，这里面设定了进行中文自然语言处理的一些参数。主要指定相应的pipeline的操作步骤以及对应的预料文件的位置。实际上我们可能用不到所有的步骤，或者要使用不同的语料库，因此可以自定义配置文件，然后再引入。那在我的项目中，我就直接读取了该properties文件。

attention：此处笔者要使用的是ner功能，但可能不想使用其他的一些annotation，想去掉。然而，Stanford CoreNLP有一些局限，就是在ner执行之前，一定需要tokenize, ssplit, pos, lemma的引入，当然这增加了很大的时间耗时。

其实我们可以先来分析一下这个properties文件：

# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
annotators = tokenize, ssplit, pos, lemma, ner, parse, coref

# segment
tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

# sentence split
ssplit.boundaryTokenRegex = [.。]|[!?！？]+

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

# ner 此处设定了ner使用的语言、模型（crf），目前SUTime只支持英文，不支持中文，所以设置为false。
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = true
ner.useSUTime = false

# regexner
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE

# parse
parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz

# depparse
depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
de

最低0.47元/天解锁文章

秦岭熊猫

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
3
评论
使用Standford coreNLP进行中文命名实体识别(NER)

Stanford CoreNLP是一个比较厉害的自然语言处理工具，很多模型都是基于深度学习方法训练得到的。先附上其官网链接：https://stanfordnlp.github.io/CoreNLP/index.html https://nlp.stanford.edu/nlp/javadoc/javanlp/ https://github.com/stanfordnlp/CoreNL...
复制链接

扫一扫