（转）OpenNLP进行中文命名实体识别（下：载入模型识别实体）

最新推荐文章于 2024-01-17 10:24:50 发布

weixin_34087301

最新推荐文章于 2024-01-17 10:24:50 发布

阅读量360

点赞数

文章标签：人工智能 c# java

原文链接：http://www.cnblogs.com/mansiisnam/p/5360783.html

版权

上一节介绍了使用OpenNLP训练命名实体识别模型的方法，并将模型写到磁盘上形成二进制bin文件，这一节就是将模型从磁盘上载入，然后进行命名实体识别。依然是先上代码：

[java] view plain copy

import java.io.File;
import java.util.HashMap;
import java.util.IdentityHashMap;
import java.util.Map;
import java.util.Map.Entry;
import opennlp.tools.cmdline.namefind.TokenNameFinderModelLoader;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.util.Span;
public class NameEntityFindTester {
// 默认参数
private double probThreshold = 0.6;
private String modelPath;
private String testFileDirPath;
public NameEntityFindTester() {
super();
// TODO Auto-generated constructor stub
}
public NameEntityFindTester(String modelPath, String testFileDirPath) {
super();
this.modelPath = modelPath;
this.testFileDirPath = testFileDirPath;
}
public NameEntityFindTester(double probThreshold, String modelPath,
String testFileDirPath) {
super();
this.probThreshold = probThreshold;
this.modelPath = modelPath;
this.testFileDirPath = testFileDirPath;
}
/**
* 生成NameFinder
*
* @return
*/
public NameFinderME prodNameFinder() {
NameFinderME finder = new NameFinderME(
new TokenNameFinderModelLoader().load(new File(modelPath)));
return finder;
}
/**
* 计算基本命名实体概率
*
* @param finder
* 命名实体识别模型
* @return
* @throws Exception
*/
public Map<String, String> cptBasicNameProb(NameFinderME finder)
throws Exception {
Map<String, String> basicNameProbResMap = new IdentityHashMap<String, String>();
String testContent = NameEntityTextFactory.loadFileTextDir(this
.getTestFileDirPath());
// TODO 大文本情况下，消耗内存大，需要改写成分批处理模式（把一个大文件分成多个小文件再批量处理）
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
// 待测词，测试结果，概率
String[] tokens = tokenizer.tokenize(testContent);
Span[] names = finder.find(tokens);
double[] nameSpanProbs = finder.probs(names);
System.out.println("tokens size: " + tokens.length);
System.out.println("names size: " + names.length);
System.out.println("name_span_probs size: " + nameSpanProbs.length);
for (int i = 0; i < names.length; i++) {
String testToken = "";
for (int j = names[i].getStart(); j <= names[i].getEnd() - 1; j++) {
testToken += tokens[j];
}
String testRes = names[i].getType() + ":"
+ Double.toString(nameSpanProbs[i]);
// TODO delete print
System.out.println("find name: \"" + testToken + "\" has res: "
+ testRes);
basicNameProbResMap.put(testToken, testRes);
}
return basicNameProbResMap;
}
/**
* 过滤除去概率值过低的识别结果
*
* @param basicNameProbResMap
* @return
*/
public Map<String, String> filterNameProbRes(
Map<String, String> basicNameProbResMap) {
Map<String, String> filttedNameProbResMap = new HashMap<String, String>();
for (Entry<String, String> entry : basicNameProbResMap.entrySet()) {
String token = entry.getKey();
String res = basicNameProbResMap.get(token);
if (Double.parseDouble(res.split(":")[1]) >= this
.getProbThreshold()) {
filttedNameProbResMap.put(token, res);
}
}
return filttedNameProbResMap;
}
/**
* 预测组件总调用方法
*
* @return
*/
public Map<String, String> execNameFindTester() {
try {
NameFinderME finder = this.prodNameFinder();
Map<String, String> basicNameProbResMap = this
.cptBasicNameProb(finder);
Map<String, String> nameProbResMap = this
.filterNameProbRes(basicNameProbResMap);
return nameProbResMap;
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
return null;
}
}
}

大家依然可以从详细的注释上获得很多的信息，设定的参数只有命名实体概率过滤的阈值，这个是可有可无的一个值，因为OpenNLP会自动过滤一波，但是如果有更高的要求，可以自己设定一个，我默认设定成0.6。然后再简单介绍几个重要的方法：1.prodNameFinder()是载入磁盘上的模型，直接传入模型的磁盘地址；2.cptBasicNameProb()是进行命名实体识别的核心方法，结果映射Map选用IdentityHashMap，因为考虑一个实体有可能被定为多个类别的情况，首先制作一个内嵌的基础分词器，其实这个分词器没有意义，是针对英文的，这就是我们为什么一开始要讲中文分词结果中间隔上空格重新写入文本的原因，基础分词器将文本内容按空格分开，放入一个String数组中，对finder的find()方法传入这个数组，就得到了命名实体识别的结果，用finder的prob()方法可以得到识别结果的概率。这时得到的命名实体结果并不是直接的词，而是一个二维数组，每一行代表识别出来的一个命名实体，该行的每一列存储命名实体对应的词元素在之前传入的String数组中的下标，因为懂命名实体识别的人都知道，命名实体也可以是多个词组成的词组，不只是词。返回的Map以这样的形式存放结果：

[html] view plain copy