使用Stanford Word Segmenter and Stanford Named Entity Recognizer (NER)实现中文命名实体识别

最新推荐文章于 2024-09-27 16:10:53 发布

shijiebei2009

最新推荐文章于 2024-09-27 16:10:53 发布

阅读量2.9w

点赞数 8

分类专栏： NLP

本文链接：https://blog.csdn.net/shijiebei2009/article/details/42525091

版权

NLP 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、分词介绍

http://nlp.stanford.edu/software/segmenter.shtml

斯坦福大学的分词器，该系统需要JDK 1.8+，从上面链接中下载stanford-segmenter-2014-10-26，解压之后，如下图所示

，进入data目录，其中有两个gz压缩文件，分别是ctb.gz和pku.gz，其中 CTB：宾州大学的中国树库训练资料， PKU：中国北京大学提供的训练资料。当然了，你也可以自己训练，一个训练的例子可以在这里面看到 http://nlp.stanford.edu/software/trainSegmenter-20080521.tar.gz

二、NER介绍

http://nlp.stanford.edu/software/CRF-NER.shtml

斯坦福NER是采用Java实现，可以识别出（PERSON，ORGANIZATION，LOCATION），使用本软件发表的研究成果需引用下述论文：

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

下载地址在：

http://nlp.sta nford.edu/~manning/papers/gibbscrf3.pdf

在NER页面可以下载到两个压缩文件，分别是stanford-ner-2014-10-26和stanford-ner-2012-11-11-chinese

将两个文件解压可看到

，默认NER可以用来处理英文，如果需要处理中文要另外处理。

Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

3 class:	Location, Person, Organization
4 class:	Location, Person, Organization, Misc
7 class:	Time, Location, Organization, Person, Money, Percent, Date

如上图可以看到针对英文提供了3class、4class、7class， http://nlp.stanford.edu:8080/ner/ 但是中文并没有，这是一个在线演示的地址，可以上去瞧瞧。

三、分词和NER使用

在Eclipse中新建一个Java Project，将data目录拷贝到项目根路径下，再把stanford-ner-2012-11-11-chinese解压的内容全部拷贝到classifiers文件夹下，将stanford-segmenter-3.5.0加入到classpath之中，将classifiers文件夹拷贝到项目根目录，将stanford-ner-3.5.0.jar和stanford-ner.jar加入到classpath中。最后，去 http://nlp.stanford.edu/software/corenlp.shtml下载stanford-corenlp-full-2014-10-31，将解压之后的stanford-corenlp-3.5.0也加入到classpath之中。最后的Eclipse中结构如下：

根据

We also provide Chinese models built from the Ontonotes Chinese named entity data. There are two models, one using distributional similarity clusters and one without. These are designed to be run on word-segmented Chinese . So, if you want to use these on normal Chinese text, you will first need to run Stanford Word Segmenter or some other Chinese word segmenter, and then run NER on the output of that!

Chinese NER

这段说明，很清晰，需要将中文分词的结果作为NER的输入，然后才能识别出NER来。

同时便于测试，本Demo使用junit-4.10.jar，下面开始上代码

import edu.stanford.nlp.ie.AbstractSequenceClassifier; 
import edu.stanford.nlp.ie.crf.CRFClassifier; 
import edu.stanford.nlp.ling.CoreLabel; 

/** 
* 
* <p> 
* ClassName ExtractDemo 
* </p> 
* <p> 
* Description 加载NER模块 
* </p> 
* 
* @author wangxu wangx89@126.com 
* <p> 
* Date 2015年1月8日 下午2:53:45 
* </p> 
* @version V1.0.0 
* 
*/ 
public class ExtractDemo { 
private static AbstractSequenceClassifier<CoreLabel> ner; 
public ExtractDemo() { 
InitNer(); 
} 
public void InitNer() { 
String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz"; // chinese.misc.distsim.crf.ser.gz 
if (ner == null) { 
ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier); 
} 
} 

public String doNer(String sent) { 
return ner.classifyWithInlineXML(sent); 
} 

public static void main(String args[]) { 
String str = "我 去 吃饭 ， 告诉 李强 一声 。"; 
ExtractDemo extractDemo = new ExtractDemo(); 
System.out.println(extractDemo.doNer(str)); 
System.out.println("Complete!"); 
} 

}

import java.io.File; 
import java.io.IOException; 
import java.util.Properties; 

import org.apache.commons.io.FileUtils; 

import edu.stanford.nlp.ie.crf.CRFClassifier; 
import edu.stanford.nlp.ling.CoreLabel; 

/** 
* 
* <p> 
* ClassName ZH_SegDemo 
* </p> 
* <p> 
* Description 使用Stanford CoreNLP进行中文分词 
* </p> 
* 
* @author wangxu wangx89@126.com 
* <p> 
* Date 2015年1月8日 下午1:56:54 
* </p> 
* @version V1.0.0 
* 
*/ 
public class ZH_SegDemo { 
public static CRFClassifier<CoreLabel> segmenter; 
static { 
// 设置一些初始化参数 
Properties props = new Properties(); 
props.setProperty("sighanCorporaDict", "data"); 
props.setProperty("serDictionary", "data/dict-chris6.ser.gz"); 
props.setProperty("inputEncoding", "UTF-8"); 
props.setProperty("sighanPostProcessing", "true"); 
segmenter = new CRFClassifier<CoreLabel>(props); 
segmenter.loadClassifierNoExceptions("data/ctb.gz", props); 
segmenter.flags.setProperties(props); 
} 

public static String doSegment(String sent) { 
String[] strs = (String[]) segmenter.segmentString(sent).toArray(); 
StringBuffer buf = new StringBuffer(); 
for (String s : strs) { 
buf.append(s + " "); 
} 
System.out.println("segmented res: " + buf.toString()); 
return buf.toString(); 
} 

public static void main(String[] args) { 
try { 
String readFileToString = FileUtils.readFileToString(new File("澳门141人食物中毒与进食“问题生蚝”有关.txt")); 
String doSegment = doSegment(readFileToString); 
System.out.println(doSegment); 

ExtractDemo extractDemo = new ExtractDemo(); 
System.out.println(extractDemo.doNer(doSegment)); 

System.out.println("Complete!"); 
} catch (IOException e) { 
e.printStackTrace(); 
} 

} 
}

注意一定是JDK 1.8+的环境，最后输出结果如下

loading dictionaries from data/dict-chris6.ser.gz...Done. Unique words in ChineseDictionary is: 423200

done [23.2 sec].

serDictionary=data/dict-chris6.ser.gz

sighanCorporaDict=data

inputEncoding=UTF-8

sighanPostProcessing=true

INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false

INFO: TagAffixDetector: building TagAffixDetector from data/dict/character_list and data/dict/in.ctb

Loading character dictionary file from data/dict/character_list

Loading affix dictionary from data/dict/in.ctb

segmented res: 2008年 9月 9日新华网 9月 8日信息：（记者张家伟）澳门特区政府卫生局疾病预防及控制中心 8 日表示，目前累计有 141 人在本地自助餐厅进食后出现食物中毒症状，其中大部分与进食 “ 问题生蚝 ” 有关。卫生局最早在 3 日公布说，有 14 名来自三个群体的港澳人士 8月 27日至 30日期间在澳门金沙酒店用餐后出现不适，患者陆续出现发热、呕吐和腹泻等类诺沃克样病毒感染的症状。初步调查显示， “ 上述情况可能和进食生蚝有关 ” 。

2008年 9月 9日新华网 9月 8日信息：（记者张家伟）澳门特区政府卫生局疾病预防及控制中心 8 日表示，目前累计有 141 人在本地自助餐厅进食后出现食物中毒症状，其中大部分与进食 “ 问题生蚝 ” 有关。卫生局最早在 3 日公布说，有 14 名来自三个群体的港澳人士 8月 27日至 30日期间在澳门金沙酒店用餐后出现不适，患者陆续出现发热、呕吐和腹泻等类诺沃克样病毒感染的症状。初步调查显示， “ 上述情况可能和进食生蚝有关 ” 。

Loading classifier from E:\workspaces\EclipseEE4.4\aaaaaa\classifiers\chinese.misc.distsim.crf.ser.gz ... done [6.8 sec].

<MISC>2008年 9月 9日新华网 9月 8日</MISC> 信息：（记者 <PERSON>张家伟</PERSON> ） <GPE>澳门</GPE> <LOC>特区</LOC> <ORG>政府卫生局疾病预防及控制中心</ORG> <MISC>8 日</MISC> 表示，目前累计有 141 人在本地自助餐厅进食后出现食物中毒症状，其中大部分与进食 “ 问题生蚝 ” 有关。 <ORG>卫生局</ORG> 最早在 3 日公布说，有 14 名来自 <MISC>三</MISC> 个群体的 <GPE>港澳</GPE> 人士 <MISC>8月 27日至 30日</MISC> 期间在 <GPE>澳门</GPE> 金沙酒店用餐后出现不适，患者陆续出现发热、呕吐和腹泻等类诺沃克样病毒感染的症状。初步调查显示， “ 上述情况可能和进食生蚝有关 ” 。

Complete!