序
本文主要研究下如何使用opennlp进行文档分类
DoccatModel
要对文档进行分类,需要一个最大熵模型(Maximum Entropy Model),在opennlp中对应DoccatModel
@Test
public void testSimpleTraining() throws IOException {
ObjectStream samples = ObjectStreamUtils.createObjectStream(
new DocumentSample("1", new String[]{"a", "b", "c"}),
new DocumentSample("1", new String[]{"a", "b", "c", "1", "2"}),
new DocumentSample("1", new String[]{"a", "b", "c", "3", "4"}),
new DocumentSample("0", new String[]{"x", "y", "z"}),
new DocumentSample("0", new String[]{"x", "y", "z", "5", "6"}),
new DocumentSample("0", new String[]{"x", "y", "z", "7", "8"}));
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 100);
params.put(TrainingParameters.CUTOFF_PARAM, 0);
DoccatModel model = DocumentCategorizerME.train("x-unspecified", samples,
params, new DoccatFactory());
DocumentCategorizer doccat = new DocumentCategorizerME(model);
double[] aProbs = doccat.categorize(new String[]{"a"});
Assert.assertEquals("1", doccat.getBestCategory(aProbs));
double[] bProbs = doccat.categorize(new String[]{"x"});
Assert.assertEquals("0", doccat.getBestCategory(bProbs));
//test to make sure sorted map's last key is cat 1 because it has the highest score.
SortedMap> sortedScoreMap = doccat.sortedScoreMap(new String[]{"a"});
Set cat = sortedScoreMap.get(sortedScoreMap.lastKey());
Assert.assertEquals(1, cat.size());
}
这里为了方便测试,先手工编写DocumentSample来做训练文本
categorize方法返回的是一个概率,getBestCategory可以根据概率来返回最为匹配的分类
输出如下:
Indexing events with TwoPass using cutoff of 0
Computing event counts... done. 6 events
Indexing... done.
Sorting and merging events... done. Reduced 6 events to 6.
Done indexing in 0.13 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 6
Number of Outcomes: 2
Number of Predicates: 14
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-4.1588830833596715 0.5
2: ... loglikelihood=-2.6351991759048894 1.0
3: ... loglikelihood=-1.9518912133474995 1.0
4: ... loglikelihood=-1.5599038834410852 1.0
5: ... loglikelihood=-1.3039748361952568 1.0
6: ... loglikelihood=-1.1229511041438864 1.0
7: ... loglikelihood=-0.9877356230661396 1.0
8: ... loglikelihood=-0.8826624290652341 1.0
9: ... loglikelihood=-0.7985244514476817 1.0
10: ... loglikelihood=-0.729543972551105 1.0
//...
95: ... loglikelihood=-0.0933856684859806 1.0
96: ... loglikelihood=-0.09245907503183291 1.0
97: ... loglikelihood=-0.09155090064000486 1.0
98: ... loglikelihood=-0.09066059844628399 1.0
99: ... loglikelihood=-0.08978764309881068 1.0
100: ... loglikelihood=-0.08893152970793908 1.0
小结
opennlp的categorize方法需要自己先切词好,单独调用不是很方便,不过如果是基于pipeline设计的,也可以理解,在pipeline前面先经过切词等操作。本文仅仅是使用官方的测试源码来做介绍,读者可以下载个中文分类文本训练集来训练,然后对中文文本进行分类。
doc