上一篇我们讲到了利用核心词典和用户定义词典以及双数组将字符串分词形成了一个数组链表(graph),下面我们继续分析getResult(Graph graph)函数,
此函数来自于Analysis类是一个抽象函数,里面有个抽象类和抽象方法,需要用户继承并自定义实现
protected abstract List<Term> getResult(Graph graph);
public abstract class Merger {
public abstract List<Term> merger();
}
下面分析IndexAnalysis的此方法具体实现。
@Override
protected List<Term> getResult(final Graph graph) {
Merger merger = new Merger() {
@Override
public List<Term> merger() {
graph.walkPath();
graph.printGraph();
// 数字发现
if (MyStaticValue.isNumRecognition && graph.hasNum) {
NumRecognition.recognitionIndex(graph.terms);
}
// 姓名识别
/*
* if (graph.hasPerson && MyStaticValue.isNameRecognition) { //
* 亚洲人名识别 new AsianPersonRecognition(graph.terms).recognition();
* graph.walkPathByScore(); NameFix.nameAmbiguity(graph.terms);
* // 外国人名识别 new
* ForeignPersonRecognition(graph.terms).recognition();
* graph.walkPathByScore(); }
*/
// 姓名识别
if (graph.hasPerson && MyStaticValue.isNameRecognition) {
PersonRecognitionTool.recognition(graph, true, false);
// 规则法
NameFix.nameAmbiguity(graph.terms);
}
// 用户自定义词典的识别
// userDefineRecognition(graph, forests);
return result(graph);
}
上面方法里graph.walkPath();是一个核心,它将图从前往后遍历并打分,并且从后往前获取得分最大的路径做为最优路径。代码如下:
public void walkPath() {
Term term = null;
// BEGIN先行打分
merger(root, 0);
// 从第一个词开始往后打分
for (int i = 0; i < terms.length; i++) {
term = terms[i];
while (term != null && term.from() != null && term != end) {
int to = term.toValue();
merger(term, to);
term = term.getNext();
}
}
optimalRoot();
}
merger(root, 0);将开始节点加入图;然后从前往后遍历打分。核心的方法是merger(term, to);代码如下
private void merger(Term fromTerm, int to) {
if(terms.length <= to) return;
Term term = null;
if (terms[to] != null) {
term = terms[to];
while (term != null) {
// 关系式to.set(from)
term.setPathScore(fromTerm);
term = term.getNext();
}
} else {
char c = chars[to];
TermNatures tn = DATDictionary.getItem(c).termNatures;
if (tn == null || tn == TermNatures.NULL) {
tn = TermNatures.NULL;
}
terms[to] = new Term(String.valueOf(c), to, tn);
terms[to].setPathScore(fromTerm);
}
}
这个函数中核心的方法就是public void setPathScore(Term from),代码如下
public void setPathScore(Term from) {
// 维特比进行最优路径的构建
double score = MathUtil.compuScore(from, this);
if (this.from == null || this.score >= score) {
this.setFromAndScore(from, score);
}
}
public static double compuScore(Term from, Term to) {
double frequency = from.termNatures().allFreq + 1;
if (frequency < 0) {
double score = from.score() + MAX_FREQUENCE;
from.score(score);
return score;
}
int nTwoWordsFreq = NgramLibrary.getTwoWordFreq(from, to);
double value = -Math.log(dSmoothingPara * frequency / (MAX_FREQUENCE + 80000) + (1 - dSmoothingPara) * ((1 - dTemp) * nTwoWordsFreq / frequency + dTemp));
if (value < 0) {
value += frequency;
}
return from.score() + value;
}
由上面代码可见,这些处理的主要目的就是算前后两个词的关联性并对这个关联路径进行打分。打分完成后利用protected Term optimalRoot() 函数获取得分最大的路径,并删除得分小的分支。代码如下
protected Term optimalRoot() {
Term to = end;
to.clearScore();
Term from = null;
while ((from = to.from()) != null) { //从后到前根据得分获得最优路径
for (int i = from.getOffe() + 1; i < to.getOffe(); i++) {
terms[i] = null;
}
if (from.getOffe() > -1) {
terms[from.getOffe()] = from;
}
// 断开横向链表.节省内存
from.setNext(null);
from.setTo(to);
from.clearScore();
to = from;
}
return root;
}
这样graph中就只剩最优路径了。回到本文开始的第二段代码,用户在最优路径后还可以根据数字和人名继续对graph进行处理,这些定制化比较强就不具体介绍了。谢谢,有不对的请指正